On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote: > > In the meantime, we start processing the blk_drain() code, so by the time > > this > > blk_pread() actually gets handled, quiesce is set, and we get stuck in the > > blk_wait_while_drained(). > > > > I don't know the qemu block stack well enough to propose an actual fix. > > > > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce > > in > > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty > > sure > > that's just a band-aid instead of fixing the deadlock. > > Related discussion: > https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html > > Actually, it seems we never fixed that problem either? > > Back then I suggested that blk_drain*() should disable request queuing > because its callers don't want to quiesce the BlockBackend, but rather > get their own requests completed. I think this approach would solve this > one as well.
In this case though, it's not their own requests right? We have incoming I/O from the guest and the eject is a separate operation. So why it would be OK to disable queuing and have ongoing I/Os (i.e. blk->in_flight > 0) racing against the block remove? > Your experiment with moving the queuing later is basically what Paolo I think it's much more flaky than that, isn't it? It's just clearing out the *current* pending queue, but nothing would stop new I/Os from being started before we dropped into the poll for blk->in_flight to be zero again. In my case, this just happens to work because the prior tray open notification has stopped new I/O being filed, right? thanks john