Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub

Filipe Manana Fri, 05 Feb 2021 14:52:44 -0800

On Fri, Feb 5, 2021 at 7:44 AM Anand Jain <[email protected]> wrote:
>
> On 2/4/2021 6:11 PM, Filipe Manana wrote:
> > On Thu, Feb 4, 2021 at 8:48 AM Anand Jain <[email protected]> wrote:
> >>
> >> On 2/3/2021 7:17 PM, [email protected] wrote:
> >>> From: Filipe Manana <[email protected]>
> >>>
> >>> When we active a swap file, at btrfs_swap_activate(), we acquire the
> >>> exclusive operation lock to prevent the physical location of the swap
> >>> file extents to be changed by operations such as balance and device
> >>> replace/resize/remove. We also call there can_nocow_extent() which,
> >>> among other things, checks if the block group of a swap file extent is
> >>> currently RO, and if it is we can not use the extent, since a write
> >>> into it would result in COWing the extent.
> >>>
> >>> However we have no protection against a scrub operation running after we
> >>> activate the swap file, which can result in the swap file extents to be
> >>> COWed while the scrub is running and operating on the respective block
> >>> group, because scrub turns a block group into RO before it processes it
> >>> and then back again to RW mode after processing it. That means an attempt
> >>> to write into a swap file extent while scrub is processing the respective
> >>> block group, will result in COWing the extent, changing its physical
> >>> location on disk.
> >>>
> >>> Fix this by making sure that block groups that have extents that are used
> >>> by active swap files can not be turned into RO mode, therefore making it
> >>> not possible for a scrub to turn them into RO mode.
> >>
> >>> When a scrub finds a
> >>> block group that can not be turned to RO due to the existence of extents
> >>> used by swap files, it proceeds to the next block group and logs a warning
> >>> message that mentions the block group was skipped due to active swap
> >>> files - this is the same approach we currently use for balance.
> >>
> >>    It is better if this info is documented in the scrub man-page. IMO.
> >>
> >>> This ends up removing the need to call btrfs_extent_readonly() from
> >>> can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
> >>> is RO through the new function btrfs_inc_block_group_swap_extents().
> >>>
>
>
> >>> The only other caller of can_nocow_extent() is the direct IO write path,
>
> There is a third caller. check_can_nocow() also calls
> can_nocow_extent(), which I missed before. Any idea where does it get to
> know that extent is RO in the threads using check_can_nocow()? I have to
> back out the RB for this reason for now.


Yes, that one I missed. However it's arguable how useful it is, because starting
nocow writers and changing a block group from RW to RO is not atomic,
and therefore
sometimes it will have no effect, see below.

I'll leave that part out and deal with the direct IO write path
optimization later perhaps,
as things are a bit entangled and not simple to distinguish whether we
are in the
direct IO path or not at can_nocow_extent().

>
>
> >>> btrfs_get_blocks_direct_write(), but that already checks if a block group
> >>> is RO through the call to btrfs_inc_nocow_writers().
>
> >>> In fact, after this
> >>> change we end up optimizing the direct IO write path, since we no longer
> >>> iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
> >>> through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
> >>> This can save time and reduce contention on the lock that protects the
> >>> rbtree (specially because it is a spinlock and not a read/write lock) on
> >>> very large filesystems, with several thousands of allocated block groups.
> >>>
> >>> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> >>> Signed-off-by: Filipe Manana <[email protected]>
> >>
> >>    I am not sure about the optimization of direct IO part, but as such
> >>    changes looks good.
>
> Clarifying about the optimization part (for both buffered and direct IO)
> - After patch 1, and patch 2, now we check on the RO extents after the
> functions btrfs_cross_ref_exist(), and csum_exist_in_range(), both of
> them have search_slot, whereas, before patch 1, and patch 2, we used to
> fail early (if the extent is RO) and avoided the search_slot, so I am
> not sure if there is optimization.

And?
Doing a single search is faster than doing 2 searches.
It does not matter to check if a block group is RO before or after
those checks, because:

1) Having a block group RO is a rather exceptional situation and, when
it happens (scrub and balance), it's
temporary. We optimize for common cases, we don't gain anything by
checking for it twice.
Your concern goes the other way around, you want to do the RO check
first to fallback more quickly into
cow mode - optimizing for the exceptional and uncommon case. I could
move up the call to
btrfs_inc_block_group_swap_extents(), to take the place of the call to
btrfs_inc_block_group_swap_extents(),
but really that is pointless since RO block groups are exceptional and
temporary, and would make the code
more complex than needed (having to track which gotos require
decrementing the nocow writers).

2) More importantly, and this is what really matters, have you thought
about what happens
if the block group is turned RO right after we checked that it was RW?
Either after calling
btrfs_extent_readonly() and before calling btrfs_inc_nocow_writers(),
or after calling both.
Should we have additional checks to see if the block group is now RO
after each one of those calls?

In case you didn't notice, starting a nocow write and setting a block
group RO is not atomic,
and that is fine (it's actually much simpler than making it atomic).
Because scrub and balance,
after turning a block group to RO mode, wait for any existing nocow
writes to complete before
they do anything with the block group's extents - new writes are
guaranteed to not allocate from
the block group or write to its existing extents because the block
group is RO now.

I hope this clarifies why having the RO block group check earlier or
later is irrelevant.

Thanks.

>
> Thanks, Anand

Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub

Reply via email to