Re: Scrub aborts due to corrupt leaf

Larkin Lowrey Wed, 10 Oct 2018 08:45:02 -0700

On 9/11/2018 11:23 AM, Larkin Lowrey wrote:

On 8/29/2018 1:32 AM, Qu Wenruo wrote:
On 2018/8/28 下午9:56, Chris Murphy wrote:
On Tue, Aug 28, 2018 at 7:42 AM, Qu Wenruo <quwenruo.bt...@gmx.com>wrote:
On 2018/8/28 下午9:29, Larkin Lowrey wrote:
On 8/27/2018 10:12 PM, Larkin Lowrey wrote:
On 8/27/2018 12:46 AM, Qu Wenruo wrote:
The system uses ECC memory and edac-util has not reported anyerrors.
However, I will run a memtest anyway.
So it should not be the memory problem.

BTW, what's the current generation of the fs?

# btrfs inspect dump-super <device> | grep generation
The corrupted leaf has generation 2862, I'm not sure how recentdid the
corruption happen.
generation              358392
chunk_root_generation   357256
cache_generation        358392
uuid_tree_generation    358392
dev_item.generation     0

I don't recall the last time I ran a scrub but I doubt it has been
more than a year.
I am running 'btrfs check --init-csum-tree' now. Hopefully thatclears
everything up.
No such luck:

Creating a new CRC tree
Checking filesystem on /dev/Cached/Backups
UUID: acff5096-1128-4b24-a15e-4ba04261edc3
Reinitialize checksum tree
csum result is 0 for block 2412149436416
extent-tree.c:2764: alloc_tree_block: BUG_ON `ret` triggered,value -28
It's ENOSPC, meaning btrfs can't find enough space for the new csumtree
blocks.
Seems bogus, there's >4TiB unallocated.
What a shame.
Btrfs won't try to allocate new chunk if we're allocating new tree
blocks for metadata trees (extent, csum, etc).

One quick (and dirty) way to avoid such limitation is to use the
following patch
<<patch removed>>
No luck.

# ./btrfs check --init-csum-tree /dev/Cached/Backups
Creating a new CRC tree
Opening filesystem to check...
Checking filesystem on /dev/Cached/Backups
UUID: acff5096-1128-4b24-a15e-4ba04261edc3
Reinitialize checksum tree
Segmentation fault (core dumped)
btrfs[16575]: segfault at 7ffc4f74ef60 ip 000000000040d4c3 sp00007ffc4f74ef50 error 6 in btrfs[400000+bf000]
# ./btrfs --version
btrfs-progs v4.17.1

I cloned  btrfs-progs from git and applied your patch.
BTW, I've been having tons of trouble with two hosts after updatingfrom kernel 4.17.12 to 4.17.14 and beyond. The fs will becomeunresponsive and all processes will end up stuck waiting on io. Thesystem will end up totally idle but unable perform any io on thefilesystem. So far things have been stable after reverting back to4.17.12. It looks like there was a btrfs change in 4.17.13. Could thatbe related to this csum tree corruption?

About once a week, or so, I'm running into the above situation where FSseems to deadlock. All IO to the FS blocks, there is no IO activity atall. I have to hard reboot the system to recover. There are no errorindications except for the following which occurs well before the FSfreezes up:

BTRFS warning (device dm-3): block group 78691883286528 has wrong amountof free spaceBTRFS warning (device dm-3): failed to load free space cache for blockgroup 78691883286528, rebuilding it now


Do I have any options other the nuking the FS and starting over?

--Larkin

Re: Scrub aborts due to corrupt leaf

Reply via email to