On Tue, Dec 23, 2025 at 12:36:19PM -0300, Fabiano Rosas wrote: > Peter Xu <[email protected]> writes: > > > On Mon, Dec 22, 2025 at 05:18:22PM +0530, Prasad Pandit wrote: > >> From: Prasad Pandit <[email protected]> > >> > >> When migration connection is broken, the QEMU and libvirtd(8) > >> process on the source side receive TCP connection reset > >> notification. QEMU sets the migration status to FAILED and > >> proceeds to migration_cleanup(). Meanwhile, Libvirtd(8) sends > >> a QMP command to migrate_set_capabilities(). > >> > >> The migration_cleanup() and qmp_migrate_set_capabilities() > >> calls race with each other. When the latter is invoked first, > >> since the migration is not running (FAILED), migration > >> capabilities are reset to false, so during migration_cleanup() > >> the QEMU process crashes with assertion failure. > >> > >> Stack trace of thread 255014: > >> #0 __pthread_kill_implementation (libc.so.6 + 0x822e8) > >> #1 raise (libc.so.6 + 0x3a73c) > >> #2 abort (libc.so.6 + 0x27034) > >> #3 __assert_fail_base (libc.so.6 + 0x34090) > >> #4 __assert_fail (libc.so.6 + 0x34100) > >> #5 yank_unregister_instance (qemu-kvm + 0x8b8280) > >> #6 migrate_fd_cleanup (qemu-kvm + 0x3c6308) > >> #7 migration_bh_dispatch_bh (qemu-kvm + 0x3c2144) > >> #8 aio_bh_poll (qemu-kvm + 0x8ba358) > >> #9 aio_dispatch (qemu-kvm + 0x8a0ab4) > >> #10 aio_ctx_dispatch (qemu-kvm + 0x8bb180) > >> > >> Introduce a new migration status FAILING and use it as an > >> interim status when an error occurs. Once migration_cleanup() > >> is done, it sets the migration status to FAILED. This helps > >> to avoid the above race condition and ensuing failure. > >> > >> Interim status FAILING is set wherever the execution moves > >> towards migration_cleanup() directly OR via: > >> > >> migration_iteration_finish bg_migration_iteration_finish > >> -> migration_bh_schedule -> migration_bh_schedule > >> -> migration_cleanup_bh -> migration_cleanup_bh > >> -> migration_cleanup -> migration_cleanup > >> -> FAILED -> FAILED > >> > >> The migration status finally moves to FAILED and reports an > >> appropriate error to the user. > > > > I raised a request while I was discussing with you internally, I didn't see > > this, I will request again: > > > > Would you please list where you decided to switch from FAILED -> FAILING, > > and where you decided not, with justifications for each of them? > > > > Let me give a detailed example in this patch, please see below. > > > >> > >> Signed-off-by: Prasad Pandit <[email protected]> > >> --- > >> migration/migration.c | 33 +++++++++++++++------------ > >> migration/multifd.c | 4 ++-- > >> qapi/migration.json | 8 ++++--- > >> tests/qtest/migration/migration-qmp.c | 3 ++- > >> tests/qtest/migration/precopy-tests.c | 5 ++-- > >> 5 files changed, 31 insertions(+), 22 deletions(-) > >> > >> diff --git a/migration/migration.c b/migration/migration.c > >> index b316ee01ab..5c32bc8fe5 100644 > >> --- a/migration/migration.c > >> +++ b/migration/migration.c > >> @@ -1216,6 +1216,7 @@ bool migration_is_running(void) > >> case MIGRATION_STATUS_DEVICE: > >> case MIGRATION_STATUS_WAIT_UNPLUG: > >> case MIGRATION_STATUS_CANCELLING: > >> + case MIGRATION_STATUS_FAILING: > >> case MIGRATION_STATUS_COLO: > >> return true; > >> default: > >> @@ -1351,6 +1352,7 @@ static void fill_source_migration_info(MigrationInfo > >> *info) > >> break; > >> case MIGRATION_STATUS_ACTIVE: > >> case MIGRATION_STATUS_CANCELLING: > >> + case MIGRATION_STATUS_FAILING: > >> case MIGRATION_STATUS_POSTCOPY_DEVICE: > >> case MIGRATION_STATUS_POSTCOPY_ACTIVE: > >> case MIGRATION_STATUS_PRE_SWITCHOVER: > >> @@ -1409,6 +1411,7 @@ static void > >> fill_destination_migration_info(MigrationInfo *info) > >> case MIGRATION_STATUS_POSTCOPY_ACTIVE: > >> case MIGRATION_STATUS_POSTCOPY_PAUSED: > >> case MIGRATION_STATUS_POSTCOPY_RECOVER: > >> + case MIGRATION_STATUS_FAILING: > >> case MIGRATION_STATUS_FAILED: > >> case MIGRATION_STATUS_COLO: > >> info->has_status = true; > >> @@ -1531,6 +1534,9 @@ static void migration_cleanup(MigrationState *s) > >> if (s->state == MIGRATION_STATUS_CANCELLING) { > >> migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING, > >> MIGRATION_STATUS_CANCELLED); > >> + } else if (s->state == MIGRATION_STATUS_FAILING) { > >> + migrate_set_state(&s->state, MIGRATION_STATUS_FAILING, > >> + MIGRATION_STATUS_FAILED); > >> } > >> > >> if (s->error) { > >> @@ -1584,7 +1590,7 @@ static void > >> migration_connect_set_error(MigrationState *s, const Error *error) > >> > >> switch (current) { > >> case MIGRATION_STATUS_SETUP: > >> - next = MIGRATION_STATUS_FAILED; > >> + next = MIGRATION_STATUS_FAILING; > > > > This is the first real change that we'll switch to FAILING when > > migration_connect_set_error() is invoked and migration failed. > > > > Please justify why setting FAILING is correct here. > > > > This function is invoked in three callers: > > > > qmp_migrate[2302] migration_connect_set_error(s, local_err); > > qmp_migrate_finish[2347] migration_connect_set_error(s, local_err); > > migration_connect[4047] migration_connect_set_error(s, error_in); > > > > At least from the initial two callers, I don't see migration_cleanup() > > invoked after setting FAILING. Could this cause migration to get into > > FAILING status forever without finally move to FAILED? > > > > Good point, I'm working on some cleanups to connection code and one > change I did there is to add a migration_cleanup() call into > migration_connect_error_propagate().
Yeh, this sounds like a good thing in general. A few quick comments below. > > 1) branch WIP: > https://gitlab.com/farosas/qemu/-/commits/migration-connect-cleanup > > 2) the patch: > --- > From 1f9eeb898f3a5efba7c183e351fa36a5471fd0b2 Mon Sep 17 00:00:00 2001 > From: Fabiano Rosas <[email protected]> > Date: Thu, 18 Dec 2025 10:54:46 -0300 > Subject: [PATCH] migration: Fold migration_cleanup into > migration_connect_error_propagate > > The code path from qmp_migrate() until the migration thread starts is > a little twisty due to the async nature of some routines. One issue is > that the async functions cannot return errors to their callers and > must instead call forward into migration_channel_connect() and pass > the error as input. > > Ideally we'd have a function that just receives the error as input and > handles it. However, currently migration_channel_connect() has a code > path that moves forward into migration_connect(), also passing the > error as input, only for migration_connect() to then check that an > error happened and call migration_cleanup(). > > Clean this up: > > 1) Make migration_connect_error_propagate() be the function that > handles the error and call it at the point the error happens in the > async code. (this is all migration code, there's no layering > violation) > > 2) Stop checking for an incoming error in migration_connect(), that > function should be only reached when everything that came before > succeeded. > > 3) Fold migration_cleanup() into migration_connect_error_propagate() > so the cleanup happens at the moment the error is detected and not > several calls down the stack. > > 4) To address the quirk that during postcopy recovery there should be > no cleanup, move that check into migration_cleanup() and return early > if doing resume. > > Notable functional changes: > > i) Assumes a larger window for what it means to be "in resume" > before: from qmp_migrate until migration_connect > after: from qmp_migration until the state transition into > MIGRATION_STATUS_POSTCOPY_RECOVER > > ii) After an error, MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP changes to > MIGRATION_STATUS_POSTCOPY_PAUSED. > > This is already the case when migration_connect_error_propagate() > was used, but not when migration_connect() receives > error_in. Seems like a pre-existing bug actually. > > iii) If the socket_start_outgoing_migration function *returns* an > error, now migration_cleanup() is called. Previously, cleanup > only happened when the error was *passed* forward, i.e. only > after the async call. > > iv) If cpr_state_save() fails, now migration_cleanup() is called. > > Signed-off-by: Fabiano Rosas <[email protected]> > --- > migration/migration.c | 35 ++++++++++++++++++----------------- > 1 file changed, 18 insertions(+), 17 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 4b1afcab24..52c1a97e46 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1481,6 +1481,14 @@ static void migration_cleanup(MigrationState *s) > MigrationEventType type; > QEMUFile *tmp = NULL; > > + /* > + * Don't do cleanup if we're waiting for another connection from > + * the user. > + */ > + if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP) { > + return; > + } I wonder if QEMU will be in RECOVER_SETUP when reaching here.. shouldn't below [2] already moved it to PAUSED for a postcopy? The other thing is, migration_cleanup() currently also takes the job of dumping the error to stderr: if (s->error) { /* It is used on info migrate. We can't free it */ error_report_err(error_copy(s->error)); } I wonder if we could merge [1] below, then here when it's PAUSED we "goto" that part (we may need to move the error_report_err to the end of migration_cleanup, though, please double check). > + > trace_migration_cleanup(); > > migration_cleanup_json_writer(s); > @@ -1585,6 +1593,14 @@ static void > migration_connect_error_propagate(MigrationState *s, Error *error) > case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP: > /* Never fail a postcopy migration; switch back to PAUSED instead */ > next = MIGRATION_STATUS_POSTCOPY_PAUSED; > + > + /* > + * Give HMP user a hint on what failed. It's normally done in > + * migration_cleanup(), but call it here explicitly because we > + * don't do cleanup when waiting for postcopy recover. > + */ > + error_report_err(error_copy(error)); [1] > + > break; > default: > /* > @@ -1598,6 +1614,7 @@ static void > migration_connect_error_propagate(MigrationState *s, Error *error) > > migrate_set_state(&s->state, current, next); [2] > migrate_error_propagate(s, error); > + migration_cleanup(s); > } > > void migration_cancel(void) > @@ -2326,12 +2343,8 @@ static void qmp_migrate_finish(MigrationAddress *addr, > bool resume_requested, > } > > if (local_err) { > - if (!resume_requested) { > - yank_unregister_instance(MIGRATION_YANK_INSTANCE); > - } > - migration_connect_error_propagate(s, error_copy(local_err)); > + migration_connect_error_propagate(s, local_err); I feel like this is a rebase accident.. now after renaming this function to migration_connect_error_propagate() it'll take over the ownership, so IIUC error_copy() will be needed here (as error_propagate() right below will also take the ownership..). > error_propagate(errp, local_err); > - return; > } > } > > @@ -4026,18 +4039,6 @@ void migration_connect(MigrationState *s, Error > *error_in) > s->expected_downtime = migrate_downtime_limit(); > if (error_in) { > migration_connect_error_propagate(s, error_in); > - if (resume) { > - /* > - * Don't do cleanup for resume if channel is invalid, but only > dump > - * the error. We wait for another channel connect from the user. > - * The error_report still gives HMP user a hint on what failed. > - * It's normally done in migration_cleanup(), but call it here > - * explicitly. > - */ > - error_report_err(error_copy(s->error)); > - } else { > - migration_cleanup(s); > - } Yeah this part looks better. > return; > } > > -- > -- Peter Xu
