On Thu, Apr 18, 2024 at 10:46:21AM -0400, Brian Foster wrote:
> So Kent had pointed out in an earlier chat that this needs to be a
> little more robust wrt to transaction restart because there is the
> potential for spinning under folio lock contention. I think there are a
> couple ways to do this that don't necessarily require reworking a bunch
> more of this code. One is to simply run a serialization sequence where
> we drop trans locks, run a blocking pagecache scan, and then restart
> this particular iteration of the loop to account for the fact that the
> extent btree may have changed since the btree locks were dropped.
The problem there was using bch2_trans_unlock() and relock() (basically
you were doing an open coded drop_locks_do()) without first attempting a
nonblocking version of the thing in the do.
drop_lock_do() has to be a slowpath thing; if it's something you always
do in a given transaction it causes livelocks.
the "try nonblocking; then drop_locks_do()" pattern works well because
for most use cases (allocating memory, taking a lock), if we end up
calling trans_unlock(), doing the blocking do, and then take a
transaction restart, the success of the blocking do means the
nonblocking do will likely succeed on the next iteration.
> When thinking about that, it occurred to me that this is roughly
> equivalent to the TOCTOU nature of fiemap in general, where we may or
> may not have caught a delalloc extent before or after it was written
> back. Therefore, it seems reasonable to just do the lock cycle, grab the
> range of folios covering what was previously identified as a hole
> (before dropping locks), and then just report it as delalloc as we would
> have had we not dropped locks at all. The code for that seems less
> kludgy and follows the allocate_dropping_locks() pattern.
>
> So this is the relevant bit of code I'm working with currently:
>
> /*
> * If a hole exists before the start of the extent key, scan
> the
> * range for pagecache data that might be pending writeback
> and
> * thus not yet exist in the extent tree.
> *
> * Since we hold extent btree locks here, we must trylock the
> * folios to avoid deadlocks with the write path. On trylock
> * failure, redo the scan with blocking locks in proper lock
> * order.
> *
> * Technically this means that the extent btree could have
> been
> * updated from folio writeback, but this is a TOCTOU race and
> * so we just report the delalloc extent based on the state of
> * the iter prior to the unlock.
> */
> if (iter.pos.offset > start) {
> ret = bch2_fiemap_hole(vinode, start << 9,
> iter.pos.offset << 9, &cur,
> true);
> if (ret == -EAGAIN) {
> ret = drop_locks_do(trans,
> bch2_fiemap_hole(vinode, start << 9,
> iter.pos.offset << 9,
> &cur, false));
> }
> if (!ret)
> have_delalloc = true;
> else if (ret != -ENOENT)
> break;
> }
>
> Note that the last param to bch2_fiemap_hole() is the nonblock param
> lifted up from bch2_seek_pagecache_[data|hole](). I also haven't done
> anything with -EAGAIN because it originates from the latter and is more
> isolated than it was previously (actually I'm thinking about pushing
> trans and the whole drop locks sequence down into bch2_fiemap_hole() if
> this otherwise seems reasonable). Thoughts?
this will give us tight synchronization between the extents btree and
the pagecache - it's as if we're locking both at the same time, no need
to think about TOCTOU issues - nice.