On Mon, Mar 15, 2021 at 7:31 PM David Sterba <dste...@suse.cz> wrote: > > On Thu, Mar 11, 2021 at 02:31:06PM +0000, fdman...@kernel.org wrote: > > From: Filipe Manana <fdman...@suse.com> > > > > When freeing a tree block we may end up adding its extent back to the > > free space cache/tree, as long as there are no more references for it, > > it was created in the current transaction and writeback for it never > > happened. This is generally fine, however when we have tree mod log > > operations it can result in inconsistent versions of a btree after > > unwinding extent buffers with the recorded tree mod log operations. > > > > This is because: > > > > * We only log operations for nodes (adding and removing key/pointers), > > for leaves we don't do anything; > > > > * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation > > for a node that points to a leaf that was deleted; > > > > * Before we apply the logged operation to unwind a node, we can have > > that leaf's extent allocated again, either as a node or as a leaf, and > > possibly for another btree. This is possible if the leaf was created in > > the current transaction and writeback for it never started, in which > > case btrfs_free_tree_block() returns its extent back to the free space > > cache/tree; > > > > * Then, before applying the tree mod log operation, some task allocates > > the metadata extent just freed before, and uses it either as a leaf or > > as a node for some btree (can be the same or another one, it does not > > matter); > > > > * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now > > get the target node with an item pointing to the metadata extent that > > now has content different from what it had before the leaf was deleted. > > It might now belong to a different btree and be a node and not a leaf > > anymore. > > > > As a consequence, the results of searches after the unwinding can be > > unpredictable and produce unexpected results. > > > > So make sure we pin extent buffers corresponding to leaves when there > > are tree mod log users. > > > > Signed-off-by: Filipe Manana <fdman...@suse.com> > > --- > > fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++- > > 1 file changed, 22 insertions(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 5e228d6ad63f..2482b26b1971 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -3310,6 +3310,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle > > *trans, > > > > if (last_ref && btrfs_header_generation(buf) == trans->transid) { > > struct btrfs_block_group *cache; > > + bool must_pin = false; > > > > if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { > > ret = check_ref_cleanup(trans, buf->start); > > @@ -3327,7 +3328,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle > > *trans, > > goto out; > > } > > > > - if (btrfs_is_zoned(fs_info)) { > > + /* > > + * If this is a leaf and there are tree mod log users, we may > > + * have recorded mod log operations that point to this leaf. > > + * So we must make sure no one reuses this leaf's extent > > before > > + * mod log operations are applied to a node, otherwise after > > + * rewinding a node using the mod log operations we get an > > + * inconsistent btree, as the leaf's extent may now be used as > > + * a node or leaf for another different btree. > > + * We are safe from races here because at this point no other > > + * node or root points to this extent buffer, so if after this > > + * check a new tree mod log user joins, it will not be able to > > + * find a node pointing to this leaf and record operations > > that > > + * point to this leaf. > > + */ > > + if (btrfs_header_level(buf) == 0) { > > + read_lock(&fs_info->tree_mod_log_lock); > > + must_pin = !list_empty(&fs_info->tree_mod_seq_list); > > + read_unlock(&fs_info->tree_mod_log_lock); > > + } > > + > > + if (must_pin || btrfs_is_zoned(fs_info)) { > > btrfs_redirty_list_add(trans->transaction, buf); > > pin_down_extent(trans, cache, buf->start, buf->len, > > 1); > > btrfs_put_block_group(cache); > > This has been added in d3575156f662 ("btrfs: zoned: redirty released > extent buffers") 5.12-rc1, so it is a regression but otherwise it sounds > like it's not related only to zoned mode. I'm not sure if this is > relevant for older stable trees because of missing > btrfs_redirty_list_add, possibly with some tweaks. Please let me know, > thanks.
It's not related to zoned filesystems at all. I just happened to reuse that if branch for the zoned case because it does the same thing I needed to do, and the call to btrfs_redirty_list_add() does nothing for the non-zoned case, so it's safe. I.e., it's the same as adding the following instead: if (must_pin) { pin_down_extent(trans, cache, buf->start, buf->len, 1); btrfs_put_block_group(cache); goto out; } I opted for the shorter version by ORing the two cases. But yes, for pre 5.12-rc1 the patch will not apply and the above code would be needed. Thanks.