On Wed, Feb 6, 2019 at 8:47 PM Josef Bacik <[email protected]> wrote:
>
> Previously we only updated the drop_progress key if we were in the
> DROP_REFERENCE stage of snapshot deletion.  This is because the
> UPDATE_BACKREF stage checks the flags of the blocks it's converting to
> FULL_BACKREF, so if we go over a block we processed before it doesn't
> matter, we just don't do anything.
>
> The problem is in do_walk_down() we will go ahead and drop the roots
> reference to any blocks that we know we won't need to walk into.
>
> Given subvolume A and snapshot B.  The root of B points to all of the
> nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
> did not modify those blocks it'll hit this condition in do_walk_down
>
> if (!wc->update_ref ||
>     generation <= root->root_key.offset)
>         goto skip;
>
> and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
> that we point at.
>
> Now assume we modified some data in B, and then took a snapshot of B and
> call it C.  C points to all the nodes in B, making every node the root
> of B points to have a refcnt > 1.  This assumes the root level is 2 or
> higher.
>
> We delete snapshot B, which does the above work in do_walk_down,
> free'ing our ref for nodes we share with A that we didn't modify.  Now
> we hit a node we _did_ modify, thus we own.  We need to walk down into
> this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
> 0 which we also own because we modified data.  We can't walk any further
> down and thus now need to walk up and start the next part of the
> deletion.  Now walk_up_proc is supposed to put us back into
> DROP_REFERENCE, but there's an exception to this
>
> if (level < wc->shared_level)
>         goto out;
>
> we are at level == 0, and our shared_level == 1.  We skip out of this
> one and go up to level 1.  Since path->slots[1] < nritems we
> path->slots[1]++ and break out of walk_up_tree to stop our transaction
> and loop back around.  Now in btrfs_drop_snapshot we have this snippet
>
> if (wc->stage == DROP_REFERENCE) {
>         level = wc->level;
>         btrfs_node_key(path->nodes[level],
>                        &root_item->drop_progress,
>                        path->slots[level]);
>         root_item->drop_level = level;
> }
>
> our stage == UPDATE_BACKREF still, so we don't update the drop_progress
> key.  This is a problem because we would have done btrfs_free_extent()
> for the nodes leading up to our current position.  If we crash or
> unmount here and go to remount we'll start over where we were before and
> try to free our ref for blocks we've already freed, and thus abort()
> out.
>
> Fix this by keeping track of the last place we dropped a reference for
> our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
> we'll start over from a place we meant to, and otherwise things continue
> to work as they did before.
>
> I have a complicated reproducer for this problem, without this patch
> we'll fail to fsck the fs when replaying the log writes log.  With this
> patch we can replay the whole log without any fsck or mount failures.
>
> Signed-off-by: Josef Bacik <[email protected]>

Reviewed-by: Filipe Manana <[email protected]>

Looks good, great catch!

> ---
>  fs/btrfs/extent-tree.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f40d6086c947..53fd4626660c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8765,6 +8765,8 @@ struct walk_control {
>         u64 refs[BTRFS_MAX_LEVEL];
>         u64 flags[BTRFS_MAX_LEVEL];
>         struct btrfs_key update_progress;
> +       struct btrfs_key drop_progress;
> +       int drop_level;
>         int stage;
>         int level;
>         int shared_level;
> @@ -9148,6 +9150,16 @@ static noinline int do_walk_down(struct 
> btrfs_trans_handle *trans,
>                                              ret);
>                         }
>                 }
> +
> +               /*
> +                * We need to update the next key in our walk control so we 
> can
> +                * update the drop_progress key accordingly.  We don't care if
> +                * find_next_key doesn't find a key because that means we're 
> at
> +                * the end and are going to clean up now.
> +                */
> +               wc->drop_level = level;
> +               find_next_key(path, level, &wc->drop_progress);
> +
>                 ret = btrfs_free_extent(trans, root, bytenr, 
> fs_info->nodesize,
>                                         parent, root->root_key.objectid,
>                                         level - 1, 0);
> @@ -9499,12 +9511,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>                 }
>
>                 if (wc->stage == DROP_REFERENCE) {
> -                       level = wc->level;
> -                       btrfs_node_key(path->nodes[level],
> -                                      &root_item->drop_progress,
> -                                      path->slots[level]);
> -                       root_item->drop_level = level;
> -               }
> +                       wc->drop_level = wc->level;
> +                       btrfs_node_key_to_cpu(path->nodes[wc->drop_level],
> +                                             &wc->drop_progress,
> +                                             path->slots[wc->drop_level]);
> +               }
> +               btrfs_cpu_key_to_disk(&root_item->drop_progress,
> +                                     &wc->drop_progress);
> +               root_item->drop_level = wc->drop_level;
>
>                 BUG_ON(wc->level == 0);
>                 if (btrfs_should_end_transaction(trans) ||
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Reply via email to