On 2018年05月16日 01:37, Liu Bo wrote:
> If a btree block, aka. extent buffer, is not available in the extent
> buffer cache, it'll be read out from the disk instead, i.e.
> 
> btrfs_search_slot()
>   read_block_for_search()  # hold parent and its lock, go to read child
>     btrfs_release_path()
>     read_tree_block()  # read child
> 
> Unfortunately, the parent lock got released before reading child, so
> commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> used 0 as parent transid to read the child block.  It forces
> read_tree_block() not to check if parent transid is different with the
> generation id of the child that it reads out from disk.
> 
> A simple PoC is included in btrfs/124,
> 
> 0. A two-disk raid1 btrfs,
> 
> 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> 
> 2. Mount this filesystem and put it in use, after a while, device tree's
>    root got COW but block A hasn't been allocated/overwritten yet.
> 
> 3. Umount it and reload the btrfs module to remove both disks from the
>    global @fs_devices list.
> 
> 4. mount -odegraded dev1 and write some data, so now block A is allocated
>    to be a leaf in checksum tree.  Note that only dev1 has the latest
>    metadata of this filesystem.
> 
> 5. Umount it and mount it again normally (with both disks), since raid1
>    can pick up one disk by the writer task's pid, if btrfs_search_slot()
>    needs to read block A, dev2 which does NOT have the latest metadata
>    might be read for block A, then we got a stale block A.
> 
> 6. As parent transid is not checked, block A is marked as uptodate and
>    put into the extent buffer cache, so the future search won't bother
>    to read disk again, which means it'll make changes on this stale
>    one and make it dirty and flush it onto disk.
> 
> To avoid the problem, parent transid needs to be passed to
> read_tree_block().
> 
> In order to get a valid parent transid, we need to hold the parent's
> lock until finish reading child.

Thanks for the detailed explanation.

It explains the first_key check error reported, thanks a lot!

Reviewed-by: Qu Wenruo <w...@suse.com>

Thanks,
Qu

> 
> Signed-off-by: Liu Bo <bo....@linux.alibaba.com>
> ---
>  fs/btrfs/ctree.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 3fd44835b386..b3f6f300e492 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2436,10 +2436,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
> *path, int level)
>       if (p->reada != READA_NONE)
>               reada_for_search(fs_info, p, level, slot, key->objectid);
>  
> -     btrfs_release_path(p);
> -
>       ret = -EAGAIN;
> -     tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
> +     tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
>                             &first_key);
>       if (!IS_ERR(tmp)) {
>               /*
> @@ -2454,6 +2452,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
> *path, int level)
>       } else {
>               ret = PTR_ERR(tmp);
>       }
> +
> +     btrfs_release_path(p);
>       return ret;
>  }
>  
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to