Re: [PATCH] btrfs: fix log replay failure due to race with space cache rebuild

Filipe Manana Fri, 22 Jan 2021 11:51:14 -0800

On Fri, Jan 22, 2021 at 6:39 PM Josef Bacik <[email protected]> wrote:
>
> On 1/22/21 12:56 PM, [email protected] wrote:
> > From: Filipe Manana <[email protected]>
> >
> > After a sudden power failure we may end up with a space cache on disk that
> > is not valid and needs to be rebuilt from scratch.
> >
> > If that happens, during log replay when we attempt to pin an extent buffer
> > from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> > the space cache to be rebuilt through the call to:
> >
> >      btrfs_cache_block_group(cache, 1);
> >
> > That is because that only waits for the task (work queue job) that loads
> > the space cache to change the cache state from BTRFS_CACHE_FAST to any
> > other value. That is ok when the space cache on disk exists and is valid,
> > but when the cache is not valid and needs to be rebuilt, it ends up
> > returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> > at caching_thread()).
> >
> > So this means that we can end up trying to unpin a range which is not yet
> > marked as free in the block group. This results in the call to
> > btrfs_remove_free_space() to return -EINVAL to
> > btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> > as well as mounting the filesystem. More specifically the -EINVAL comes
> > from free_space_cache.c:remove_from_bitmap(), because the requested range
> > is not marked as free space (ones in the bitmap), we have the following
> > condition triggered:
> >
> > static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> > (...)
> >         if (ret < 0 || search_start != *offset)
> >              return -EINVAL;
> > (...)
> >
> > It's the "search_start != *offset" that results in the condition being
> > evaluated to true.
> >
> > When this happens we got the following in dmesg/syslog:
> >
> > [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 
> > devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
> > [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> > [72383.418536] BTRFS info (device sdb): has skinny extents
> > [72383.423846] BTRFS info (device sdb): start tree-log replay
> > [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong 
> > amount of free space
> > [72383.427686] BTRFS warning (device sdb): failed to load free space cache 
> > for block group 30408704, rebuilding it now
> > [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: 
> > errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> > [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: 
> > errno=-22 unknown (Failed to recover log tree)
> > [72383.460241] BTRFS error (device sdb): open_ctree failed
> >
> > We also mark the range for the extent buffer in the excluded extents io
> > tree. That is fine when the space cache is valid on disk and we can load
> > it, in which case it causes no problems.
> >
> > However, for the case where we need to rebuild the space cache, because it
> > is either invalid or it is missing, having the extent buffer range marked
> > in the excluded extents io tree leads to a -EINVAL failure from the call
> > to btrfs_remove_free_space(), resulting in the log replay and mount to
> > fail. This is because by having the range marked in the excluded extents
> > io tree, the caching thread ends up never adding the range of the extent
> > buffer as free space in the block group since the calls to
> > add_new_free_space(), called from load_extent_tree_free(), filter out any
> > ranges that are marked as excluded extents.
> >
> > So fix this by making sure that during log replay we wait for the caching
> > task to finish completely when we need to rebuild a space cache, and also
> > drop the need to mark the extent buffer range in the excluded extents io
> > tree, as well as clearing ranges from that tree at
> > btrfs_finish_extent_commit().
> >
> > This started to happen with some frequency on large filesystems having
> > block groups with a lot of fragmentation since the recent commit
> > e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> > fact the issue has been there for years, it was just much less likely
> > to happen.
> >
> > Signed-off-by: Filipe Manana <[email protected]>
> > ---
> >   fs/btrfs/extent-tree.c | 27 ++++++++++++++++++---------
> >   1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 30b1a630dc2f..89d1b0551cf8 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2602,8 +2602,6 @@ int btrfs_pin_extent_for_log_replay(struct 
> > btrfs_trans_handle *trans,
> >       struct btrfs_block_group *cache;
> >       int ret;
> >
> > -     btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes);
> > -
> >       cache = btrfs_lookup_block_group(trans->fs_info, bytenr);
> >       if (!cache)
> >               return -EINVAL;
> > @@ -2615,6 +2613,15 @@ int btrfs_pin_extent_for_log_replay(struct 
> > btrfs_trans_handle *trans,
> >        * the pinned extents.
> >        */
> >       btrfs_cache_block_group(cache, 1);
> > +     /*
> > +      * Make sure we wait until the cache is completely built in case it is
> > +      * missing or is invalid and therefore needs to be rebuilt.
> > +      */
> > +     if (btrfs_test_opt(trans->fs_info, SPACE_CACHE)) {
> > +             ret = btrfs_wait_block_group_cache_done(cache);
> > +             if (ret)
> > +                     return ret;
> > +     }
>
> Sorry I didn't explain this well in my previous reply and on slack.
>
> If we are going to keep the btrfs_remove_free_space() below, which we have to,
> we cannot remove the btrfs_add_excluded_extent() without also unconditionally
> loading the space cache.  The 'load_cache_only' only means we'll wait for the
> space cache to be loaded in the SPACE_CACHE case, not that we won't start 
> caching.
>
> Consider the free space tree case, we'll be doing the normal caching, and 
> we'll
> either hit the case that you're trying to fix, because we attempt to remove a
> free space range that has been partially cached in the bitmap.  Or we'll hit 
> the
> case where we think we've removed the free space even though it's not been
> loaded yet.
>
> Your fix needs to remove the excluded extent parts as well as do
>
> btrfs_cache_block_group(cache, 0);
> btrfs_wait_block_group_cache_done(cache);
>
> in order to be properly safe in all cases.  Thanks,


Yes, I realized that later, I have a new version being tested right now:

https://pastebin.com/pgQ2y11a

I messed and didn't realize the wait is necessary even if there is no
space cache or tree enabled.

>
> Josef

Re: [PATCH] btrfs: fix log replay failure due to race with space cache rebuild

Reply via email to