Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM

Qu Wenruo Mon, 21 Dec 2020 03:47:18 -0800



On 2020/12/21 下午6:08, Nik. wrote:

Dear all,

the forwarded mail below came back yesterday with the error
"Diagnostic-Code: X-Postfix; TLS is required, but was not offered by
host vger.kernel.org[23.128.96.18]".

Is it really intended that your mail server does not offer TLS?


Can't help on that, not a vger manager nor know anything. (Most if not
all kernel mail lists are hosted by vger, each mail list can't do much)

But I can definitely answer some of your btrfs problem.


Kind regards,

Nik.

--

15.12.2020 18:40, Nik.:

Dear all,

after almost a year without problems I need again your advice about
the same computer, but this time it is (hopefully only) the root FS
that failed. I have backups of everything except a couple of files in
/etc, so nothing critical, but probably it would be interesting for
somebody to see how behaved btrfs in such a situation.

The story in short:

- the FS switched to ro mode. Initially I thought that it is due to
insufficient free space (have already had similar situations) and
deleted some old snapshots. Within half a day it happened 3 more
times, though.


Any detailed report on that RO?
We should have it addressed upstream, if you still hit that, I guess we
need more investigation (if it's not caused by memory corruption)


- so I booted in memtest86 and it gave me a lot of errors! This NAS is
9 years old and I was already looking for replacement, but it is not
easy to find 8-bay NAS for 2,5" drives...

- took the drive out from the failed system and tried to mount it on
another (healthy?) PC. I am getting:

root@ubrun:~# mount -t btrfs -o subvol=@ /dev/sdb1 /mnt/sd
mount: /mnt/sd: wrong fs type, bad option, bad superblock on
/dev/sdb1, missing codepage or helper program, or other error.
root@ubrun:~# dmesg |tail
[   50.672561] Policy zone: Normal
[  185.190764] BTRFS info (device sdb1): disk space caching is enabled
[  185.190767] BTRFS info (device sdb1): has skinny extents
[  185.199331] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
0, flush 0, corrupt 65, gen 0
[  185.246051] BTRFS critical (device sdb1): corrupt leaf:
block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
inline ref type: 54


This is indeed some memory bitflip, and your initial kernel is not newer
enough to detect it at write time.

If using newer enough kernel, such corrupted metadata shouldn't even
reach disk. (Although it still means you will get the fs RO)

There are only 4 valid types for extent refs:

TREE_BLOCK_REF   176(0xb0)
EXTENT_DATA_REF  178(0xb2)
SHARED_BLOCK_REF 182(0xb6)
SHARED_DATA_REF  184(0xb8)

The invalid type is:

                  54(0x36)

The diff is 0x80 to SHARED_BLOCK_REF, indeed one bit flipped.

[  185.246055] BTRFS error (device sdb1): block=50850988032 read time
tree block corruption detected
[  185.247070] BTRFS critical (device sdb1): corrupt leaf:
block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
inline ref type: 54
[  185.247073] BTRFS error (device sdb1): block=50850988032 read time
tree block corruption detected
[  185.247093] BTRFS error (device sdb1): failed to read block groups: -5
[  185.281382] BTRFS error (device sdb1): open_ctree failed
root@ubrun:~#

How should one proceed?


Since it's caused by bitflip and you mentioned the system has tons of
memory error, I believe there will be tons of similar problems
scattering around your fs.

For repair, I don't really believe btrfs-check can or will be able to
fix any bitflip, not to mention so many possible more bitflips.

It's better just to use your backup.

BTW, for detection for extent tree bitflip is introduced in v5.4.
Next time at least you can catch the faulty hardware before it screws up
your data.

Thanks,
Qu


Kind regards

Nik.

Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM

Reply via email to