Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> Note that even if bdrv_drained_begin() would already be marked as

"if ... were already marked"

> GRAPH_UNLOCKED, TSA would not complain about the instance in
> bdrv_change_aio_context() before this change, because it is preceded
> by a bdrv_graph_rdunlock_main_loop() call. It is not correct to
> release the lock here, and in case the caller holds a write lock, it
> wouldn't actually release the lock.
> 
> In combination with block-stream, there is a deadlock that can happen
> because of this [0]. In particular, it can happen that
> main thread              IO thread
> 1. acquires write lock
>                          in blk_co_do_preadv_part():
>                          2. have non-zero blk->in_flight
>                          3. try to acquire read lock
> 4. begin drain
> 
> Steps 3 and 4 might be switched. Draining will poll and get stuck,
> because it will see the non-zero in_flight counter. But the IO thread
> will not make any progress either, because it cannot acquire the read
> lock.
> 
> After this change, all paths to bdrv_change_aio_context() drain:
> bdrv_change_aio_context() is called by:
> 1. bdrv_child_cb_change_aio_ctx() which is only called via the
>    change_aio_ctx() callback, see below.
> 2. bdrv_child_change_aio_context(), see below.
> 3. bdrv_try_change_aio_context(), where a drained section is
>    introduced.
> 
> The change_aio_ctx() callback is called by:
> 1. bdrv_attach_child_common_abort(), where a drained section is
>    introduced.
> 2. bdrv_attach_child_common(), where a drained section is introduced.
> 3. bdrv_parent_change_aio_context(), see below.
> 
> bdrv_child_change_aio_context() is called by:
> 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
>    section is invariant.
> 2. child_job_change_aio_ctx(), which is only called via the
>    change_aio_ctx() callback, see above.
> 
> bdrv_parent_change_aio_context() is called by:
> 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
>    section is invariant.
> 
> This resolves all code paths. Note that bdrv_attach_child_common()
> and bdrv_attach_child_common_abort() hold the graph write lock and
> callers of bdrv_try_change_aio_context() might too, so they are not
> actually allowed to drain either. This will be addressed in the
> following commits.
> 
> More granular draining is not trivially possible, because
> bdrv_change_aio_context() can recursively call itself e.g. via
> bdrv_child_change_aio_context().
> 
> [0]: 
> https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63...@virtuozzo.com/
> 
> Reported-by: Andrey Drobyshev <andrey.drobys...@virtuozzo.com>
> Signed-off-by: Fiona Ebner <f.eb...@proxmox.com>
> ---
> 
> Changes in v2:
> * Split up into smaller pieces, flesh out commit messages.
> 
>  block.c | 27 ++++++++++++++-------------
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 01144c895e..7148618504 100644
> --- a/block.c
> +++ b/block.c
> @@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState 
> *reopen_state);
>  
>  static bool bdrv_backing_overridden(BlockDriverState *bs);
>  
> -static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> -                                    GHashTable *visited, Transaction *tran,
> -                                    Error **errp);
> +static bool GRAPH_RDLOCK
> +bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> +                        GHashTable *visited, Transaction *tran, Error 
> **errp);

For static functions, we should have rhe GRAPH_RDLOCK annotation both
here and in the actual definition, too.

>  /* If non-zero, use only whitelisted block drivers */
>  static int use_bdrv_whitelist;
> @@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK 
> bdrv_attach_child_common_abort(void *opaque)
>  
>          /* No need to visit `child`, because it has been detached already */
>          visited = g_hash_table_new(NULL, NULL);
> +        bdrv_drain_all_begin();
>          ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
>                                                visited, tran, &error_abort);
> +        bdrv_drain_all_end();
>          g_hash_table_destroy(visited);
>  
>          /* transaction is supposed to always succeed */
> @@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
>              bool ret_child;
>  
>              g_hash_table_add(visited, new_child);
> +            bdrv_drain_all_begin();
>              ret_child = child_class->change_aio_ctx(new_child, child_ctx,
>                                                      visited, aio_ctx_tran,
>                                                      NULL);
> +            bdrv_drain_all_end();
>              if (ret_child == true) {
>                  error_free(local_err);
>                  ret = 0;

Should we document in the header file that BdrvChildClass.change_aio_ctx
is called with the node drained?

We could add assertions to bdrv_child/parent_change_aio_context or at
least comments to this effect. (Assertions might be over the top because
it's easy to verify that both are only called from
bdrv_change_aio_context().)

> @@ -7619,10 +7623,6 @@ bool bdrv_child_change_aio_context(BdrvChild *c, 
> AioContext *ctx,
>  static void bdrv_set_aio_context_clean(void *opaque)
>  {
>      BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
> -    BlockDriverState *bs = (BlockDriverState *) state->bs;
> -
> -    /* Paired with bdrv_drained_begin in bdrv_change_aio_context() */
> -    bdrv_drained_end(bs);
>  
>      g_free(state);
>  }
> @@ -7650,6 +7650,8 @@ static TransactionActionDrv set_aio_context = {
>   *
>   * @visited will accumulate all visited BdrvChild objects. The caller is
>   * responsible for freeing the list afterwards.
> + *
> + * @bs must be drained.
>   */
>  static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>                                      GHashTable *visited, Transaction *tran,
> @@ -7664,21 +7666,17 @@ static bool bdrv_change_aio_context(BlockDriverState 
> *bs, AioContext *ctx,
>          return true;
>      }
>  
> -    bdrv_graph_rdlock_main_loop();
>      QLIST_FOREACH(c, &bs->parents, next_parent) {
>          if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
> -            bdrv_graph_rdunlock_main_loop();
>              return false;
>          }
>      }
>  
>      QLIST_FOREACH(c, &bs->children, next) {
>          if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
> -            bdrv_graph_rdunlock_main_loop();
>              return false;
>          }
>      }
> -    bdrv_graph_rdunlock_main_loop();
>  
>      state = g_new(BdrvStateSetAioContext, 1);
>      *state = (BdrvStateSetAioContext) {
> @@ -7686,8 +7684,7 @@ static bool bdrv_change_aio_context(BlockDriverState 
> *bs, AioContext *ctx,
>          .bs = bs,
>      };
>  
> -    /* Paired with bdrv_drained_end in bdrv_set_aio_context_clean() */
> -    bdrv_drained_begin(bs);
> +    assert(bs->quiesce_counter > 0);
>  
>      tran_add(tran, &set_aio_context, state);
>  
> @@ -7720,7 +7717,11 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, 
> AioContext *ctx,
>      if (ignore_child) {
>          g_hash_table_add(visited, ignore_child);
>      }
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
>      ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>      g_hash_table_destroy(visited);

I think you're ending the drained section too early here. Previously,
the nodes were kept drained until after tran_abort/commit(), and I think
that's important (tran_commit() is the thing that actually switches the
AioContext).

Kevin


Reply via email to