Am 28.04.23 um 18:54 schrieb Juan Quintela: > Kevin Wolf <kw...@redhat.com> wrote: >> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben: >>> Kevin Wolf <kw...@redhat.com> wrote: >>>>> I am perhaps a bit ingenuous here, but it is there a way to convince >>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread? >>>> >>>> I believe we're talking about a technicality here. I asked another more >>>> fundamental question that nobody has answered yet: >>>> >>>> Why do you think that it's ok to call bdrv_writev_vmstate() without >>>> holding the BQL? >>> >>> I will say this function starts by bdrv_ (i.e. block layer people) and >>> endes with _vmstate (i.e. migration people). >>> >>> To be honest, I don't know. That is why I _supposed_ you have an idea. >> >> My idea is that bdrv_*() can only be called when you hold the BQL, or >> for BlockDriverStates in an iothread the AioContext lock. >> >> Apparently dropping the BQL in migration code was introduced in Paolo's >> commit 9b095037527. > > Damn. I reviewed it, so I am as guilty as the author. > 10 years later without problems I will not blame that patch. > > I guess we changed something else that broke doing it without the lock. > > But no, I still don't have suggestions/ideas. >
I do feel like the issue might be very difficult to trigger under normal circumstances. Depending on the configuration and what you do in the guest, aio_poll in a vCPU thread does not happen often and I imagine snapshot-save is also not a super frequent operation for most people. It still takes me a while to trigger the issue by issuing lots of pflash writes and running snapshot-save in a loop, I'd guess about 30-60 snapshots. Another reason might be that generated co-wrappers were less common in the past? >> I'm not sure what this was supposed to improve in >> the case of snapshots because the VM is stopped anyway. Is it? Quoting Juan:> d- snapshots are a completely different beast, that don't really stop > the guest in the same way at that point, and sometimes it shows in > this subtle details. >> Would anything bad happen if we removed the BQL unlock/lock section in >> qemu_savevm_state() again? > > Dunno. > > For what is worth, I can say that it survives migration-test, but don't > ask me why/how/... > > Fiona, can you check if it fixes your troubles? > Just removing the single section in qemu_savevm_state() breaks even the case where snapshot_save_job_bh() is executed in the main thread, because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl() which asserts that it's not already locked. Also removing the lock/unlock pair in ram_init_bitmaps() seems to work. But I'm not sure what else a full semantic revert of commit 9b095037527 would entail. Best Regards, Fiona