On Fri, Jan 22, 2021 at 07:07:45PM +0000, [email protected] wrote:
> From: Filipe Manana <[email protected]>
> 
> After a sudden power failure we may end up with a space cache on disk that
> is not valid and needs to be rebuilt from scratch.
> 
> If that happens, during log replay when we attempt to pin an extent buffer
> from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
> the space cache to be rebuilt through the call to:
> 
>     btrfs_cache_block_group(cache, 1);
> 
> That is because that only waits for the task (work queue job) that loads
> the space cache to change the cache state from BTRFS_CACHE_FAST to any
> other value. That is ok when the space cache on disk exists and is valid,
> but when the cache is not valid and needs to be rebuilt, it ends up
> returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
> at caching_thread()).
> 
> So this means that we can end up trying to unpin a range which is not yet
> marked as free in the block group. This results in the call to
> btrfs_remove_free_space() to return -EINVAL to
> btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
> as well as mounting the filesystem. More specifically the -EINVAL comes
> from free_space_cache.c:remove_from_bitmap(), because the requested range
> is not marked as free space (ones in the bitmap), we have the following
> condition triggered:
> 
> static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
> (...)
>        if (ret < 0 || search_start != *offset)
>             return -EINVAL;
> (...)
> 
> It's the "search_start != *offset" that results in the condition being
> evaluated to true.
> 
> When this happens we got the following in dmesg/syslog:
> 
> [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 
> 1 transid 1432 /dev/sdb scanned by mount (3816007)
> [72383.417837] BTRFS info (device sdb): disk space caching is enabled
> [72383.418536] BTRFS info (device sdb): has skinny extents
> [72383.423846] BTRFS info (device sdb): start tree-log replay
> [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong 
> amount of free space
> [72383.427686] BTRFS warning (device sdb): failed to load free space cache 
> for block group 30408704, rebuilding it now
> [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: 
> errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
> [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 
> unknown (Failed to recover log tree)
> [72383.460241] BTRFS error (device sdb): open_ctree failed
> 
> We also mark the range for the extent buffer in the excluded extents io
> tree. That is fine when the space cache is valid on disk and we can load
> it, in which case it causes no problems.
> 
> However, for the case where we need to rebuild the space cache, because it
> is either invalid or it is missing, having the extent buffer range marked
> in the excluded extents io tree leads to a -EINVAL failure from the call
> to btrfs_remove_free_space(), resulting in the log replay and mount to
> fail. This is because by having the range marked in the excluded extents
> io tree, the caching thread ends up never adding the range of the extent
> buffer as free space in the block group since the calls to
> add_new_free_space(), called from load_extent_tree_free(), filter out any
> ranges that are marked as excluded extents.
> 
> So fix this by making sure that during log replay we wait for the caching
> task to finish completely when we need to rebuild a space cache, and also
> drop the need to mark the extent buffer range in the excluded extents io
> tree, as well as clearing ranges from that tree at
> btrfs_finish_extent_commit().
> 
> This started to happen with some frequency on large filesystems having
> block groups with a lot of fragmentation since the recent commit
> e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in
> fact the issue has been there for years, it was just much less likely
> to happen.
> 
> Signed-off-by: Filipe Manana <[email protected]>

Added to misc-next, thanks. As this is more likely to happen on 5.11 due
to async loading I'll probably forward it to the next rc.

Reply via email to