A tree based integrity verification
   is important for all data, which is missing.
     Fix:
     In this RFC patch it proposes to use same disk from with the
metadata
   is read to read the data.

The obvious problem I found is, the idea only works for RAID1/10.

For striped profile it makes no sense, or even have a worse chance to
get stale data.


To me, the idea of using possible better mirror makes some sense, but
very profile limited.

  Yep. This problem and fix is only for the mirror based profiles
  such as raid1/raid10.

Then current implementation lacks such check.

Further more, data and metadata can lie in different chunks and have
different chunk types.

 Right. Current tests for this RFC were only for raid1.

 But the final patch can fix that.

 In fact current patch works for all the cases except for the case of
 replace is running and mix of metadata:raid1 and data:raid56

 We need some cleanups in mirror_num, basically we need to bring it
 under #define. and handle it accordingly in __btrfs_map_block()

Another idea I get inspired from the idea is, make it more generic so
that bad/stale device get a lower priority.

  When it comes to reading junk data, its not about the priority its
  about the eliminating. When the problem is only few blocks, I am
  against making the whole disk as bad.

Although it suffers the same problem as I described.

To make the point short, the use case looks very limited.

  It applies to raid1/raid10 with nodatacow (which implies nodatasum).
  In my understanding that's not rare.

  Any comments on the fix offered here?

The implementation part is, is eb->read_mirror reliable?

E.g. if the data and the eb are in different chunks, and the stale
happens in the chunk of eb but not in the data chunk?


 eb and regular data are indeed in different chunks always. But eb
 can never be stale as there is parent transid which is verified against
 the read eb. However we do not have such a check for the data (this is
 the core of the issue here) and so we return the junk data silently.

 Also any idea why the generation number for the extent data is not
 incremented [2] when -o nodatacow and notrunc option is used, is it
 a bug? the dump-tree is taken with the script as below [1]
 (this corruption is seen with or without generation number is
 being incremented, but as another way to fix for the corruption we can
 verify the inode EXTENT_DATA generation from the same disk from which
 the data is read).

[1]
 umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
 mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
 echo 1st write: && \
dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 conv=fsync,notrunc && sync && \
 btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
 echo --- && \
 btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
 echo 2nd write: && \
dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 conv=fsync,notrunc && sync && \
 btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
 echo --- && \
 btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"


1st write:
        item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
                generation 6 transid 6 size 4096 nbytes 4096
                block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
                sequence 1 flags 0x3(NODATASUM|NODATACOW)
                atime 1553058460.163985452 (2019-03-20 13:07:40)
                ctime 1553058460.163985452 (2019-03-20 13:07:40)
                mtime 1553058460.163985452 (2019-03-20 13:07:40)
                otime 1553058460.163985452 (2019-03-20 13:07:40)
---
        item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
                generation 6 type 1 (regular)
                extent data disk byte 13631488 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
2nd write:
        item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
                generation 6 transid 7 size 4096 nbytes 4096
                block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
                sequence 2 flags 0x3(NODATASUM|NODATACOW)
                atime 1553058460.163985452 (2019-03-20 13:07:40)
                ctime 1553058460.189985450 (2019-03-20 13:07:40)
                mtime 1553058460.189985450 (2019-03-20 13:07:40)
                otime 1553058460.163985452 (2019-03-20 13:07:40)
---
        item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
                generation 6 type 1 (regular)   <----- [2]
                extent data disk byte 13631488 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)


Thanks, Anand

Reply via email to