On Wed, Apr 12, 2023 at 1:54 PM Hanna Czenczek <hre...@redhat.com> wrote: > On 05.04.23 18:31, Paolo Bonzini wrote: > > + if (busy || blk->in_flight) { > > + return true; > > + } > > + > > + if (qatomic_read(&blk->request_queuing) == BLK_QUEUE_READY) { > > + qatomic_set(&blk->request_queuing, BLK_QUEUE_QUIESCENT); > > + } > > + return false; > > } > > This implicitly relies on nobody increasing blk->in_flight (or > dev_ops->drained_poll() returning `true` again) while the BB is starting > to be drained, because if the function were to be called again after it > has returned `false` once per drained section (not sure if that’s > possible![1]), then we’d end up in the original situation, with > in_flight elevated and queuing enabled.
Yes, it does. > Is that really strictly guaranteed somehow or is it rather a complex > conglomerate of many cases that in the end happen to work out > individually? I mean, I could imagine that running > BlockDevOps.drained_begin() is supposed to guarantee that, but it can’t, > because only NBD seems to implement it. The commit message talks about > IDE being fine (by accident?) because it needs BQL availability to > submit new requests. But that’s very complex and I’d rather have a > strict requirement to guarantee correctness. It's a conglomerate of three cases each of which is sufficient (BQL, aio_disable_external, bdrv_drained_begin---plus just not using blk_inc_in_flight could be a fourth, of course). Of these, aio_disable_external() is going away in favor of the .bdrv_drained_begin callback; and blk_inc_in_flight() is used rarely in the first place so I thought it'd be not too hard to have this requirement. > [1] If the blk_root_drained_poll() isn’t called anymore after returning > `false`, all will be good, but I assume it will be, because we have a > quiesce_counter, not a quiesce_bool. We could kind of emulate this by > continuing to return `false` after blk_root_drained_poll() has returned > `false` once, until the quiesce_counter becomes 0. > We could also have blk_root_drained_poll(), if it sees in_flight > 0 && > request_queuing == BLK_QUEUE_QUIESCENT, revert request_queuing to > BLK_QUEUE_READY and resume all queued requests. The intended long term fix is to remove request queuing and, if a request is submitted while BLK_QUEUE_QUIESCENT, give an assertion failure. But since the hang requires blk_inc_in_flight() in the device, perhaps in the short term documenting it in blk_inc_in_flight() may be enough? Paolo