Hi all, following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk. The result showed that sometime the kernel recomputed the parity wrongly.
I tested the following kernel - 4.6.1 - 4.5.4 and both showed the same behavior. The test was performed as described below: 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB truncate -s 500M disk1.img; losetup -f disk1.img truncate -s 500M disk2.img; losetup -f disk2.img truncate -s 500M disk3.img; losetup -f disk3.img sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2] sudo mount /dev/loop0 mnt/ 2) I created a file with a length of 128kb: python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt sudo umount mnt/ 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located: /dev/loop0: offset=81788928+16*4096 (64k, second half of the file: 'bdbbbb.....) /dev/loop1: offset=61865984+16*4096 (64k, first half of the file: 'adaaaa.....) /dev/loop2: offset=61865984+16*4096 (64k, parity: '\x03\x00\x03\x03\x03.....) 4) I tried to corrupt each disk (one disk per test), and then run a scrub: for example for the disk /dev/loop2: sudo dd if=/dev/zero of=/dev/loop2 bs=1 \ seek=$((61865984+16*4096)) count=5 sudo mount /dev/loop0 mnt sudo btrfs scrub start mnt/. 5) I check the disks at the offsets above, to verify that the data/parity is correct However I found that: 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly); 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors: 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk 2.b) the kernel repaired the damage, and rebuild a correct parity I have to point out another strange thing: in dmesg I found two kinds of messages: msg1) [....] [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled [ 1021.366949] BTRFS: has skinny extents [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 msg2) [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled [ 1017.435074] BTRFS: has skinny extents [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0 [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) Another strangeness is that SCRUB sometime reports ERROR: there are uncorrectable errors and sometime reports WARNING: errors detected during scrubbing, corrected but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh". Br G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
test.sh
Description: Bourne shell script