On Wed, Feb 6, 2019 at 8:47 PM Josef Bacik <[email protected]> wrote: > > Previously we only updated the drop_progress key if we were in the > DROP_REFERENCE stage of snapshot deletion. This is because the > UPDATE_BACKREF stage checks the flags of the blocks it's converting to > FULL_BACKREF, so if we go over a block we processed before it doesn't > matter, we just don't do anything. > > The problem is in do_walk_down() we will go ahead and drop the roots > reference to any blocks that we know we won't need to walk into. > > Given subvolume A and snapshot B. The root of B points to all of the > nodes that belong to A, so all of those nodes have a refcnt > 1. If B > did not modify those blocks it'll hit this condition in do_walk_down > > if (!wc->update_ref || > generation <= root->root_key.offset) > goto skip; > > and in "goto skip" we simply do a btrfs_free_extent() for that bytenr > that we point at. > > Now assume we modified some data in B, and then took a snapshot of B and > call it C. C points to all the nodes in B, making every node the root > of B points to have a refcnt > 1. This assumes the root level is 2 or > higher. > > We delete snapshot B, which does the above work in do_walk_down, > free'ing our ref for nodes we share with A that we didn't modify. Now > we hit a node we _did_ modify, thus we own. We need to walk down into > this node and we set wc->stage == UPDATE_BACKREF. We walk down to level > 0 which we also own because we modified data. We can't walk any further > down and thus now need to walk up and start the next part of the > deletion. Now walk_up_proc is supposed to put us back into > DROP_REFERENCE, but there's an exception to this > > if (level < wc->shared_level) > goto out; > > we are at level == 0, and our shared_level == 1. We skip out of this > one and go up to level 1. Since path->slots[1] < nritems we > path->slots[1]++ and break out of walk_up_tree to stop our transaction > and loop back around. Now in btrfs_drop_snapshot we have this snippet > > if (wc->stage == DROP_REFERENCE) { > level = wc->level; > btrfs_node_key(path->nodes[level], > &root_item->drop_progress, > path->slots[level]); > root_item->drop_level = level; > } > > our stage == UPDATE_BACKREF still, so we don't update the drop_progress > key. This is a problem because we would have done btrfs_free_extent() > for the nodes leading up to our current position. If we crash or > unmount here and go to remount we'll start over where we were before and > try to free our ref for blocks we've already freed, and thus abort() > out. > > Fix this by keeping track of the last place we dropped a reference for > our block in do_walk_down. Then if wc->stage == UPDATE_BACKREF we know > we'll start over from a place we meant to, and otherwise things continue > to work as they did before. > > I have a complicated reproducer for this problem, without this patch > we'll fail to fsck the fs when replaying the log writes log. With this > patch we can replay the whole log without any fsck or mount failures. > > Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: Filipe Manana <[email protected]> Looks good, great catch! > --- > fs/btrfs/extent-tree.c | 26 ++++++++++++++++++++------ > 1 file changed, 20 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index f40d6086c947..53fd4626660c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8765,6 +8765,8 @@ struct walk_control { > u64 refs[BTRFS_MAX_LEVEL]; > u64 flags[BTRFS_MAX_LEVEL]; > struct btrfs_key update_progress; > + struct btrfs_key drop_progress; > + int drop_level; > int stage; > int level; > int shared_level; > @@ -9148,6 +9150,16 @@ static noinline int do_walk_down(struct > btrfs_trans_handle *trans, > ret); > } > } > + > + /* > + * We need to update the next key in our walk control so we > can > + * update the drop_progress key accordingly. We don't care if > + * find_next_key doesn't find a key because that means we're > at > + * the end and are going to clean up now. > + */ > + wc->drop_level = level; > + find_next_key(path, level, &wc->drop_progress); > + > ret = btrfs_free_extent(trans, root, bytenr, > fs_info->nodesize, > parent, root->root_key.objectid, > level - 1, 0); > @@ -9499,12 +9511,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > } > > if (wc->stage == DROP_REFERENCE) { > - level = wc->level; > - btrfs_node_key(path->nodes[level], > - &root_item->drop_progress, > - path->slots[level]); > - root_item->drop_level = level; > - } > + wc->drop_level = wc->level; > + btrfs_node_key_to_cpu(path->nodes[wc->drop_level], > + &wc->drop_progress, > + path->slots[wc->drop_level]); > + } > + btrfs_cpu_key_to_disk(&root_item->drop_progress, > + &wc->drop_progress); > + root_item->drop_level = wc->drop_level; > > BUG_ON(wc->level == 0); > if (btrfs_should_end_transaction(trans) || > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
