On Thu, Apr 18, 2024 at 10:46:21AM -0400, Brian Foster wrote:
> So Kent had pointed out in an earlier chat that this needs to be a
> little more robust wrt to transaction restart because there is the
> potential for spinning under folio lock contention. I think there are a
> couple ways to do this that don't necessarily require reworking a bunch
> more of this code. One is to simply run a serialization sequence where
> we drop trans locks, run a blocking pagecache scan, and then restart
> this particular iteration of the loop to account for the fact that the
> extent btree may have changed since the btree locks were dropped.

The problem there was using bch2_trans_unlock() and relock() (basically
you were doing an open coded drop_locks_do()) without first attempting a
nonblocking version of the thing in the do.

drop_lock_do() has to be a slowpath thing; if it's something you always
do in a given transaction it causes livelocks.

the "try nonblocking; then drop_locks_do()" pattern works well because
for most use cases (allocating memory, taking a lock), if we end up
calling trans_unlock(), doing the blocking do, and then take a
transaction restart, the success of the blocking do means the
nonblocking do will likely succeed on the next iteration.

> When thinking about that, it occurred to me that this is roughly
> equivalent to the TOCTOU nature of fiemap in general, where we may or
> may not have caught a delalloc extent before or after it was written
> back. Therefore, it seems reasonable to just do the lock cycle, grab the
> range of folios covering what was previously identified as a hole
> (before dropping locks), and then just report it as delalloc as we would
> have had we not dropped locks at all. The code for that seems less
> kludgy and follows the allocate_dropping_locks() pattern.
> 
> So this is the relevant bit of code I'm working with currently:
> 
>                 /*
>                  * If a hole exists before the start of the extent key, scan 
> the
>                  * range for pagecache data that might be pending writeback 
> and
>                  * thus not yet exist in the extent tree.
>                  *
>                  * Since we hold extent btree locks here, we must trylock the
>                  * folios to avoid deadlocks with the write path. On trylock
>                  * failure, redo the scan with blocking locks in proper lock
>                  * order.
>                  *
>                  * Technically this means that the extent btree could have 
> been
>                  * updated from folio writeback, but this is a TOCTOU race and
>                  * so we just report the delalloc extent based on the state of
>                  * the iter prior to the unlock.
>                  */
>                 if (iter.pos.offset > start) {
>                         ret = bch2_fiemap_hole(vinode, start << 9,
>                                                iter.pos.offset << 9, &cur, 
> true);
>                         if (ret == -EAGAIN) {
>                                 ret = drop_locks_do(trans,
>                                         bch2_fiemap_hole(vinode, start << 9,
>                                                          iter.pos.offset << 9,
>                                                          &cur, false));
>                         }
>                         if (!ret)
>                                 have_delalloc = true;
>                         else if (ret != -ENOENT)
>                                 break;
>                 }
> 
> Note that the last param to bch2_fiemap_hole() is the nonblock param
> lifted up from bch2_seek_pagecache_[data|hole](). I also haven't done
> anything with -EAGAIN because it originates from the latter and is more
> isolated than it was previously (actually I'm thinking about pushing
> trans and the whole drop locks sequence down into bch2_fiemap_hole() if
> this otherwise seems reasonable). Thoughts?

this will give us tight synchronization between the extents btree and
the pagecache - it's as if we're locking both at the same time, no need
to think about TOCTOU issues - nice.

Reply via email to