Re: [Qemu-devel] [PATCH] vl.c/exit: pause cpus before closing block devices

Paolo Bonzini Tue, 08 Aug 2017 05:47:56 -0700

On 08/08/2017 13:56, Kevin Wolf wrote:
> Am 08.08.2017 um 13:04 hat Paolo Bonzini geschrieben:
>> On 08/08/2017 12:02, Kevin Wolf wrote:
>>> Am 04.08.2017 um 13:46 hat Paolo Bonzini geschrieben:
>>>> On 04/08/2017 11:58, Stefan Hajnoczi wrote:
>>>>>> the root cause of this bug is related to this as well:
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02945.html
>>>>>>
>>>>>> From commit 99723548 we started assuming (incorrectly?) that blk_
>>>>>> functions always WILL have an attached BDS, but this is not always true,
>>>>>> for instance, flushing the cache from an empty CDROM.
>>>>>>
>>>>>> Paolo, can we move the flight counter increment outside of the
>>>>>> block-backend layer, is that safe?
>>>>> I think the bdrv_inc_in_flight(blk_bs(blk)) needs to be fixed
>>>>> regardless of the throttling timer issue discussed below.  BB cannot
>>>>> assume that the BDS graph is non-empty.
>>>>
>>>> Can we make bdrv_aio_* return NULL (even temporarily) if there is no
>>>> attached BDS?  That would make it much easier to fix.
>>>
>>> Would the proper fix be much more complicated than the following? I must
>>> admit that I don't fully understand the current state of affairs with
>>> respect to threading, AioContext etc. so I may well be missing
>>> something.
>>
>> Not much, but it's not complete either.  The issues I see are that: 1)
>> blk_drain_all does not take the new counter into account;
> 
> Ok, I think this does the trick:
> 
> void blk_drain_all(void)
> {
>     BlockBackend *blk = NULL;
> 
>     bdrv_drain_all_begin();
>     while ((blk = blk_all_next(blk)) != NULL) {
>         blk_drain(blk);
>     }
>     bdrv_drain_all_end();
> }


Small note: blk_drain_all is called while no AioContext has been
acquired, while blk_drain requires you to acquire blk's AioContext.

And this of course should be split in blk_drain_all_begin/blk_drain_all_end.

>> 2) bdrv_drain_all callers need to be audited to see if they should be
>> blk_drain_all (or more likely, only device BlockBackends should be drained).
> 
> qmp_transaction() is unclear to me. It should be changed in some way
> anyway because it uses bdrv_drain_all() rather than a begin/end pair.

I think every ->prepare should do drained_begin and ->clean should end
the section.  Most of them are doing it already.

> do_vm_stop() and vm_stop_force_state() probably want blk_drain_all().

Or just all devices.  From the October 2016 discussion, "qdev
drive properties would install a vmstate notifier to quiesce their own
BlockBackend rather than relying on a blind bdrv_drain_all from vm_stop"
(though there'll surely be some non-qdevified straggler).

> xen_invalidate_map_cache() - wtf? Looks like the wrong layer to do this,
> but I guess blk_drain_all(), too.

It seems to be an advisory operation triggered by balloon deflation.
Whatever.

> block_migration_cleanup() is just lazy and really means a blk_drain()
> for its own BlockBackends. blk_drain_all() as the simple conversion.
> 
> migration/savevm: Migration wants blk_drain_all() to get the devices
> quiesced.

It already does vm_stop though.  It doesn't seem to be necessary at all.

> qemu-io: blk_drain_all(), too.

Or just blk_drain(qemuio_blk).

> Hm, looks like there won't be many callers of bdrv_drain_all() left. :-)

Not many blk_drain_all(), and that's not a bad thing.

You missed bdrv_reopen_multiple.  Maybe it also should not do anything,
and the caller should do begin/end instead.

>>> Note that my blk_drain() implementation doesn't necessarily drain
>>> blk_bs(blk) completely, but only those requests that came from the
>>> specific BlockBackend. I think this is what the callers want, but
>>> if otherwise, it shouldn't be hard to change.
>>
>> Yes, this should be what they want.
> 
> Apparently not; block jobs don't complete with it any more. I haven't
> checked in detail, but it makes sense that they can have a BH (e.g. for
> block_job_defer_to_main_loop) without a request being in flight.

That BH should be handled here:

    while (!job->completed) {
        aio_poll(qemu_get_aio_context(), true);
    }

so that should be okay.  (Block jobs have been a huge pain every time I
touched bdrv_drain, indeed...).

> So I'm including an unconditional bdrv_drain() again. Or I guess,
> calling aio_poll() unconditionally and including its return value
> in the loop condition would be the cleaner approach?

Note that if you have no block device then you don't have an AioContext
for aio_poll() to use.  In fact, we're really close to removing the
AioContext dependency for submitting I/O requests (AioContext remains
only as a place for network drivers to register their sockets' I/O
handlers), so you shouldn't use it from block-backend.c.

That's why I suggested returning NULL from AIO operations if there is no
blk_bs(blk).  It may be a little less tidy (I'm more worried about
static analysis lossage than anything else), but it does make some
sense.  Most callers are checking blk_is_available or blk_is_inserted
already, which explains why we're not seeing more crashes in
block-backend.c.

If you can set aside the blk_bs(blk) == NULL case, you're basically
reinventing BDRV_POLL_WHILE at this point.  If you add a bdrv_wakeup to
blk_dec_in_flight, and use BDRV_POLL_WHILE in blk_drain, then things
should in theory just work (though most likely they won't for some
reason that takes hours to debug).

Paolo

Re: [Qemu-devel] [PATCH] vl.c/exit: pause cpus before closing block devices

Reply via email to