Am 11.06.24 um 16:04 schrieb Stefan Hajnoczi: > On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote: >> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi: >>> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote: >>>> The fact that the snapshot_save_job_bh() is scheduled in the main >>>> loop's qemu_aio_context AioContext means that it might get executed >>>> during a vCPU thread's aio_poll(). But saving of the VM state cannot >>>> happen while the guest or devices are active and can lead to assertion >>>> failures. See issue #2111 for two examples. Avoid the problem by >>>> scheduling the snapshot_save_job_bh() in the iohandler AioContext, >>>> which is not polled by vCPU threads. >>>> >>>> Solves Issue #2111. >>>> >>>> This change also solves the following issue: >>>> >>>> Since commit effd60c878 ("monitor: only run coroutine commands in >>>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond >>>> right after starting the job anymore, but only after the job finished, >>>> which can take a long time. The reason is, because after commit >>>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext. >>>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the >>>> coroutine cannot be entered immediately anymore, but needs to be >>>> scheduled to the main loop's qemu_aio_context AioContext. But >>>> snapshot_save_job_bh() was scheduled first to the same AioContext and >>>> thus gets executed first. >>>> >>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111 >>>> Signed-off-by: Fiona Ebner <f.eb...@proxmox.com> >>>> --- >>>> >>>> While initial smoke testing seems fine, I'm not familiar enough with >>>> this to rule out any pitfalls with the approach. Any reason why >>>> scheduling to the iohandler AioContext could be wrong here? >>> >>> If something waits for a BlockJob to finish using aio_poll() from >>> qemu_aio_context then a deadlock is possible since the iohandler_ctx >>> won't get a chance to execute. The only suspicious code path I found was >>> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not >>> sure whether it triggers this scenario. Please check that code path. >>> >> >> Sorry, I don't understand. Isn't executing the scheduled BH the only >> additional progress that the iohandler_ctx needs to make compared to >> before the patch? How exactly would that cause issues when waiting for a >> BlockJob? >> >> Or do you mean something waiting for the SnapshotJob from >> qemu_aio_context before snapshot_save_job_bh had the chance to run? > > Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has > no chance to execute. But I haven't audited the code to understand > whether this can happen. So job_finish_sync_locked() is executed in job_completed_txn_abort_locked() when the following branch is taken
> if (!job_is_completed_locked(other_job)) and there is no other job in the transaction, so we can assume other_job being the snapshot-save job itself. The callers of job_completed_txn_abort_locked(): 1. in job_do_finalize_locked() if job->ret is non-zero. The callers of which are: 1a. in job_finalize_locked() if JOB_VERB_FINALIZE is allowed, meaning job->status is JOB_STATUS_PENDING, so job_is_completed_locked() will be true. 1b. job_completed_txn_success_locked() sets job->status to JOB_STATUS_WAITING before, so job_is_completed_locked() will be true. 2. in job_completed_locked() it is only done if job->ret is non-zero, in which case job->status was set to JOB_STATUS_ABORTING by the preceding job_update_rc_locked(), and thus job_is_completed_locked() will be true. 3. in job_cancel_locked() if job->deferred_to_main_loop is true, which is set in job_co_entry() before job_exit() is scheduled as a BH and is also set in job_do_dismiss_locked(). In the former case, the snapshot_save_job_bh has already been executed. In the latter case, job_is_completed_locked() will be true (since job_early_fail() is not used for the snapshot job). However, job_finish_sync_locked() is also executed via job_cancel_sync_all() during qemu_cleanup(). With bad timing there, I suppose the issue could arise. In fact, I could trigger it with the following hack on top: > diff --git a/migration/savevm.c b/migration/savevm.c > index 0086b76ab0..42c93176ba 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -3429,6 +3429,17 @@ static void snapshot_load_job_bh(void *opaque) > > static void snapshot_save_job_bh(void *opaque) > { > + static int n = 0; > + n++; > + if (n < 10000000) { > + aio_bh_schedule_oneshot(iohandler_get_aio_context(), > + snapshot_save_job_bh, opaque); > + if (n % 1000000 == 0) { > + error_report("iteration %d", n); > + } > + return; > + } > + > Job *job = opaque; > SnapshotJob *s = container_of(job, SnapshotJob, common); > Once the AIO_WAIT_WHILE_UNLOCKED(job->aio_context, ...) was hit in job_finish_sync_locked(), the snapshot_save_job_bh would never be executed again, leading to the deadlock you described. Best Regards, Fiona