Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: > So it looks like main thread is processing job-dismiss request and is > holding write lock taken in block_job_remove_all_bdrv() (frame #20 > above). At the same time iothread spawns a coroutine which performs IO > request. Before the coroutine is spawned, blk_aio_prwv() increases > 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is > trying to acquire the read lock. But main thread isn't releasing the > lock as blk_root_drained_poll() returns true since blk->in_flight > 0. > Here's the deadlock.
And for the IO test you provided, it's client->nb_requests that behaves similarly to blk->in_flight here. The issue also reproduces easily when issuing the following QMP command in a loop while doing IO on a device: > void qmp_block_locked_drain(const char *node_name, Error **errp) > { > BlockDriverState *bs; > > bs = bdrv_find_node(node_name); > if (!bs) { > error_setg(errp, "node not found"); > return; > } > > bdrv_graph_wrlock(); > bdrv_drained_begin(bs); > bdrv_drained_end(bs); > bdrv_graph_wrunlock(); > } It seems like either it would be necessary to require: 1. not draining inside an exclusively locked section or 2. making sure that variables used by drained_poll routines are only set while holding the reader lock ? Those seem to require rather involved changes, so a third option might be to make draining inside an exclusively locked section possible, by embedding such locked sections in a drained section: > diff --git a/blockjob.c b/blockjob.c > index 32007f31a9..9b2f3b3ea9 100644 > --- a/blockjob.c > +++ b/blockjob.c > @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) > * one to make sure that such a concurrent access does not attempt > * to process an already freed BdrvChild. > */ > + bdrv_drain_all_begin(); > bdrv_graph_wrlock(); > while (job->nodes) { > GSList *l = job->nodes; > @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) > g_slist_free_1(l); > } > bdrv_graph_wrunlock(); > + bdrv_drain_all_end(); > } > > bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) This seems to fix the issue at hand. I can send a patch if this is considered an acceptable approach. Best Regards, Fiona