On 3/20/19 2:27 PM, Qu Wenruo wrote:


On 2019/3/20 下午1:47, Anand Jain wrote:


A tree based integrity verification
    is important for all data, which is missing.
      Fix:
      In this RFC patch it proposes to use same disk from with the
metadata
    is read to read the data.

The obvious problem I found is, the idea only works for RAID1/10.

For striped profile it makes no sense, or even have a worse chance to
get stale data.


To me, the idea of using possible better mirror makes some sense, but
very profile limited.

   Yep. This problem and fix is only for the mirror based profiles
   such as raid1/raid10.

Then current implementation lacks such check.

Further more, data and metadata can lie in different chunks and have
different chunk types.

  Right. Current tests for this RFC were only for raid1.

  But the final patch can fix that.

  In fact current patch works for all the cases except for the case of
  replace is running and mix of metadata:raid1 and data:raid56

  We need some cleanups in mirror_num, basically we need to bring it
  under #define. and handle it accordingly in __btrfs_map_block()


Wait for a minute.

There is a hidden pitfall from the very beginning.

Consider such chunk layout:
Chunk A Type DATA|RAID1
Stripe 1: Dev 1
Stripe 2: Dev 2

Chunk B Type METADATA|RAID1
Stripe 1: Dev 2
Stripe 2: Dev 1

 Right this fix can't solve this case.
 So one down. The other choice I had is to
Quarantine whole disk by comparing dev1 and dev2 generation # in the superblock during mount. or Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in the chunk tree during mount. or Always read eb from both dev1 and dev2, if fails then fail its corresponding data block as well if any. or During mount compare dev1 and dev2 generation #, record the range of generation number missing on the disk. But this need ondisk format changes. (But nodatacow, notrunc reg data extent write must change gen#). or
 Similar to btree eb read pages, nest the extent data read as well.
 (But nodatacow, notrunc reg data extent write must change gen#).

Thanks, Anand


Then when we found stale metadata in chunk B mirror 1, caused by dev 2,
then your patch consider device 2 stale, and try to use mirror num 2 to
read from data chunk.

However in data chunk, mirror num 2 means it's still from device 2, and
we get stale data.

So the eb->mirror_num can still map to bad/stale device, due to the
flexibility provided by btrfs per-chunk mapping.

Thanks,
Qu


Another idea I get inspired from the idea is, make it more generic so
that bad/stale device get a lower priority.

   When it comes to reading junk data, its not about the priority its
   about the eliminating. When the problem is only few blocks, I am
   against making the whole disk as bad.

Although it suffers the same problem as I described.

To make the point short, the use case looks very limited.

   It applies to raid1/raid10 with nodatacow (which implies nodatasum).
   In my understanding that's not rare.

   Any comments on the fix offered here?

The implementation part is, is eb->read_mirror reliable?

E.g. if the data and the eb are in different chunks, and the stale
happens in the chunk of eb but not in the data chunk?


  eb and regular data are indeed in different chunks always. But eb
  can never be stale as there is parent transid which is verified against
  the read eb. However we do not have such a check for the data (this is
  the core of the issue here) and so we return the junk data silently.

  Also any idea why the generation number for the extent data is not
  incremented [2] when -o nodatacow and notrunc option is used, is it
  a bug? the dump-tree is taken with the script as below [1]
  (this corruption is seen with or without generation number is
  being incremented, but as another way to fix for the corruption we can
  verify the inode EXTENT_DATA generation from the same disk from which
  the data is read).

[1]
  umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
  mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
  echo 1st write: && \
  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
conv=fsync,notrunc && sync && \
  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
  echo --- && \
  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
  echo 2nd write: && \
  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
conv=fsync,notrunc && sync && \
  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
  echo --- && \
  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"


1st write:
     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
         generation 6 transid 6 size 4096 nbytes 4096
         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
         sequence 1 flags 0x3(NODATASUM|NODATACOW)
         atime 1553058460.163985452 (2019-03-20 13:07:40)
         ctime 1553058460.163985452 (2019-03-20 13:07:40)
         mtime 1553058460.163985452 (2019-03-20 13:07:40)
         otime 1553058460.163985452 (2019-03-20 13:07:40)
---
     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
         generation 6 type 1 (regular)
         extent data disk byte 13631488 nr 4096
         extent data offset 0 nr 4096 ram 4096
         extent compression 0 (none)
2nd write:
     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
         generation 6 transid 7 size 4096 nbytes 4096
         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
         sequence 2 flags 0x3(NODATASUM|NODATACOW)
         atime 1553058460.163985452 (2019-03-20 13:07:40)
         ctime 1553058460.189985450 (2019-03-20 13:07:40)
         mtime 1553058460.189985450 (2019-03-20 13:07:40)
         otime 1553058460.163985452 (2019-03-20 13:07:40)
---
     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
         generation 6 type 1 (regular)   <----- [2]
         extent data disk byte 13631488 nr 4096
         extent data offset 0 nr 4096 ram 4096
         extent compression 0 (none)


Thanks, Anand

Reply via email to