Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

Qu Wenruo Thu, 03 Nov 2016 19:11:18 -0700


At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote:

Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated 
a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first 
find where a file was stored, and then I tried to corrupt the data disks (when unmounted) 
or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB

        truncate -s 500M disk1.img; losetup -f disk1.img
        truncate -s 500M disk2.img; losetup -f disk2.img
        truncate -s 500M disk3.img; losetup -f disk3.img
        sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
        sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

        python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
        sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to 
find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096        (64k, second half of the file: 
'bdbbbb.....)
/dev/loop1: offset=61865984+16*4096        (64k, first half of the file: 
'adaaaa.....)
/dev/loop2: offset=61865984+16*4096        (64k, parity: 
'\x03\x00\x03\x03\x03.....)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
        sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
                        seek=$((61865984+16*4096)) count=5
        sudo mount /dev/loop0 mnt
        sudo btrfs scrub start mnt/.


I reproduced your problem and find that seems to be a problem of race.

The problem I found is mostly the same as yours, but with some more details:

The script is like:
-------
#!/bin/bash

dev1=/dev/vdb6
dev2=/dev/vdb7
dev3=/dev/vdb8
mnt=/mnt/test

umount $mnt
# First full stripe layout is
# dev1: Parity, dev2: Data Stripe1 dev3: Data Stripe2
mkfs.btrfs $dev1 $dev2 $dev3 -f -m raid5 -d raid5
mount $dev1 $mnt -o nospace_cache
xfs_io -f -c "pwrite 0 128k" $mnt/file1
sync
umount $mnt

dmesg -C
# destory parity of data stripe 1
dd if=/dev/urandom of=$dev2 bs=1 count=64k seek=546308096


# btrfs-progs scrub
# Newly introduced function, offline scrub, quite handy here
btrfs check --scrub $dev1

# kernel scrub
mount $dev1 $mnt -o nospace_cache
btrfs scrub start -B $mnt

# Feel free to umount and call offline scrub
# umount $mnt
# btrfs check --scrub $dev1
-------

The result is divided into the following types:

1) Kernel scrub reports 16 recoverable csum error
   Same as offline scrub. All correct.

2) Kernel scrub reports 17 or more recoverable csum error
   Even we only corrupted 16 pages, the extra csum error comes out
   almost no where

3) Kernel scrub reports some unrecoverable csum error
   Totally insane. But quite low possibility.

I digged a little further into the case 2) and found:
a) Kernel is scrubbing correct range
   So the extra csum error is not caused by checking wrong logical
   bytenr range

b) Kernel is scrubbing some pages twice
   That's the cause of the problem.


And unlike most of us assume, in fact scrub full fs is a very racy thing.
Scrubbing full fs is split into N scrubbing ioctls for each device.

So for above script, kernel is doing *3* scrubbing work.

For other profile it may not be a problem, but for RAID5/6 race canhappen easily like:


Scrub dev1(P)           |  Scrub dev2(D1)        | Scrub dev3(D2)
---------------------------------------------------------------
Read out full stripe    | Read out D1            | Read out D2
                        | Check Csum for D1      | Check Csum for D2
                        | Csum mismatch (err++)  | Csum matches
Cal parity              | Read out full stripe   |
Parity mismatch         | Do recovery            |
Check full stripe       |
D1 csum mismatch (err++)|

So csum mismatch can be counted twice.

And since scrubbing for corrupted data stripe can easily race withscrubbing for parity, if timing happens in a worse situation, it canlead to unrecoverable csum error.

On the other hand, if you only scrub the damaged device only, no racewill happen so case 2) 3) will just disappear.


Would you please try to only scrub one device one time?


5) I check the disks at the offsets above, to verify that the data/parity is 
correct

You could try the new offline scrub, it can save you a lot of time tofind data/parity corruption.

https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

And of course, more reliable than kernel scrub (single thread, no extraIO no race) :)


Thanks,
Qu


However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, 
but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the 
corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was 
the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity

I have to point out another strange thing: in dmesg I found two kinds of 
messages:

msg1)
        [....]
        [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
        [ 1021.366949] BTRFS: has skinny extents
        [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
        [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
        [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0

msg2)
        [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
        [ 1017.435074] BTRFS: has skinny extents
        [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
        [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872,      root 5, inode 257, offset 
65536, length 4096, links 1 (path: out.txt)
        [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
        [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
        [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 3, gen 0
        [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) 
error at logical 142802944 on dev /dev/loop0
        [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to rerun it 
several times to show the problem because it doesn't happen every time. Pay attention 
that the offset and the loop device name are hard coded. You must run the script in the 
same directory where it is: eg "bash test.sh".

Br
G.Baroncelli



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

Reply via email to