On Sat, Nov 08, 2025 at 04:59:31PM -0600, Eric Blake wrote: > The user "John Doe" reported a deadlock when attempting to use > qemu-storage-daemon to serve both a base file over NBD, and a qcow2 > file with that NBD export as its backing file, from the same process, > even though it worked just fine when there were two q-s-d processes. > The bulk of the NBD server code properly uses coroutines to make > progress in an event-driven manner, but the code for spawning a new > coroutine at the point when listen(2) detects a new client was > hard-coded to use the global GMainContext; in other words, the > callback that triggers nbd_client_new to let the server start the > negotiation sequence with the client requires the main loop to be > making progress. However, the code for bdrv_open of a qcow2 image > with an NBD backing file uses an AIO_WAIT_WHILE nested event loop to > ensure that the entire qcow2 backing chain is either fully loaded or > rejected, without any side effects from the main loop causing unwanted > changes to the disk being loaded (in short, an AioContext represents > the set of actions that are known to be safe while handling block > layer I/O, while excluding any other pending actions in the global > main loop with potentially larger risk of unwanted side effects). > > This creates a classic case of deadlock: the server can't progress to > the point of accept(2)ing the client to write to the NBD socket > because the main loop is being starved until the AIO_WAIT_WHILE > completes the bdrv_open, but the AIO_WAIT_WHILE can't progress because > it is blocked on the client coroutine stuck in a read() of the > expected magic number from the server side of the socket. > > This patch adds a new API to allow clients to opt in to listening via > an AioContext rather than a GMainContext. This will allow NBD to fix > the deadlock by performing all actions during bdrv_open in the main > loop AioContext. > > An upcoming patch will then add a unit test (kept separate to make it > easier to rearrange the series to demonstrate the deadlock without > this patch). > > Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 > Signed-off-by: Eric Blake <[email protected]> > > --- > v2: Retitle and add new API rather than changing semantics of > existing qio_net_listener_set_client_func; use qio accessor rather > than direct access to the sioc fd and a lower-level aio call > --- > include/io/net-listener.h | 16 ++++++++++++++++ > io/net-listener.c | 36 +++++++++++++++++++++++++++++++++--- > 2 files changed, 49 insertions(+), 3 deletions(-)
Reviewed-by: Daniel P. Berrangé <[email protected]> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
