Am 31.07.23 um 09:35 schrieb Juan Quintela: > Fiona Ebner <f.eb...@proxmox.com> wrote: >> The bdrv_create_dirty_bitmap() function (which is also called by >> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is >> a wrapper around a coroutine, and when not called in coroutine context >> would use bdrv_poll_co(). Such a call would trigger an assert() if the >> correct AioContext hasn't been acquired before, because polling would >> try to release the AioContext. > > The ingenous in me thinks: > > If the problem is that bdrv_poll_co() release an AioContext that it > don't have acquired, perhaps we should fix bdrv_poll_co().
AFAIU, it's necessary to release the lock there, so somebody else can make progress while we poll. >> The issue would happen for snapshots, Sorry, what I wrote is not true, because bdrv_co_load_vmstate() is a coroutine, so I think it would be the same situation as for migration and bdrv_co_getlength() would be called directly by the wrapper. And the patch itself also got an issue. AFAICT, when qcow2_co_load_vmstate() is called, we already have acquired the context for the drive we load the snapshot from, and since polling/AIO_WAIT_WHILE requires that the caller has acquired the context exactly once, we'd need to distinguish if the dirty bitmap is for that drive (can't acquire the context a second time) or a different drive (need to acquire the context for the first time). >> but won't in practice, because >> saving a snapshot with a block dirty bitmap is currently not possible. >> The reason is that dirty_bitmap_save_iterate() returns whether it has >> completed the bulk phase, which only happens in postcopy, so >> qemu_savevm_state_iterate() will always return 0, meaning the call >> to iterate will be repeated over and over again without ever reaching >> the completion phase. >> >> Still, this would make the code more robust for the future. > > What I wonder is if we should annotate this calls somehow that they need > to be called from a coroutine context, because I would have never found > it. > > And just thinking loud: > > I still wonder if we should make incoming migration its own thread. > Because we got more and more problems because it is a coroutine, that in > non-multifd case can consume a whole CPU alone, so it makes no sense to > have a coroutine. > That would then be a situation where the issue would pop up (assuming the new thread doesn't also use a coroutine to load the state). > On the other hand, with multifd, it almost don't use any resources, so ....