On Tue, Dec 23, 2025 at 12:36:19PM -0300, Fabiano Rosas wrote:
> Peter Xu <[email protected]> writes:
> 
> > On Mon, Dec 22, 2025 at 05:18:22PM +0530, Prasad Pandit wrote:
> >> From: Prasad Pandit <[email protected]>
> >> 
> >> When migration connection is broken, the QEMU and libvirtd(8)
> >> process on the source side receive TCP connection reset
> >> notification. QEMU sets the migration status to FAILED and
> >> proceeds to migration_cleanup(). Meanwhile, Libvirtd(8) sends
> >> a QMP command to migrate_set_capabilities().
> >> 
> >> The migration_cleanup() and qmp_migrate_set_capabilities()
> >> calls race with each other. When the latter is invoked first,
> >> since the migration is not running (FAILED), migration
> >> capabilities are reset to false, so during migration_cleanup()
> >> the QEMU process crashes with assertion failure.
> >> 
> >>   Stack trace of thread 255014:
> >>    #0  __pthread_kill_implementation (libc.so.6 + 0x822e8)
> >>    #1  raise (libc.so.6 + 0x3a73c)
> >>    #2  abort (libc.so.6 + 0x27034)
> >>    #3  __assert_fail_base (libc.so.6 + 0x34090)
> >>    #4  __assert_fail (libc.so.6 + 0x34100)
> >>    #5  yank_unregister_instance (qemu-kvm + 0x8b8280)
> >>    #6  migrate_fd_cleanup (qemu-kvm + 0x3c6308)
> >>    #7  migration_bh_dispatch_bh (qemu-kvm + 0x3c2144)
> >>    #8  aio_bh_poll (qemu-kvm + 0x8ba358)
> >>    #9  aio_dispatch (qemu-kvm + 0x8a0ab4)
> >>    #10 aio_ctx_dispatch (qemu-kvm + 0x8bb180)
> >> 
> >> Introduce a new migration status FAILING and use it as an
> >> interim status when an error occurs. Once migration_cleanup()
> >> is done, it sets the migration status to FAILED. This helps
> >> to avoid the above race condition and ensuing failure.
> >> 
> >> Interim status FAILING is set wherever the execution moves
> >> towards migration_cleanup() directly OR via:
> >> 
> >>   migration_iteration_finish  bg_migration_iteration_finish
> >>   -> migration_bh_schedule    -> migration_bh_schedule
> >>    -> migration_cleanup_bh     -> migration_cleanup_bh
> >>     -> migration_cleanup        -> migration_cleanup
> >>      -> FAILED                   -> FAILED
> >> 
> >> The migration status finally moves to FAILED and reports an
> >> appropriate error to the user.
> >
> > I raised a request while I was discussing with you internally, I didn't see
> > this, I will request again:
> >
> > Would you please list where you decided to switch from FAILED -> FAILING,
> > and where you decided not, with justifications for each of them?
> >
> > Let me give a detailed example in this patch, please see below.
> >
> >> 
> >> Signed-off-by: Prasad Pandit <[email protected]>
> >> ---
> >>  migration/migration.c                 | 33 +++++++++++++++------------
> >>  migration/multifd.c                   |  4 ++--
> >>  qapi/migration.json                   |  8 ++++---
> >>  tests/qtest/migration/migration-qmp.c |  3 ++-
> >>  tests/qtest/migration/precopy-tests.c |  5 ++--
> >>  5 files changed, 31 insertions(+), 22 deletions(-)
> >> 
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index b316ee01ab..5c32bc8fe5 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -1216,6 +1216,7 @@ bool migration_is_running(void)
> >>      case MIGRATION_STATUS_DEVICE:
> >>      case MIGRATION_STATUS_WAIT_UNPLUG:
> >>      case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_FAILING:
> >>      case MIGRATION_STATUS_COLO:
> >>          return true;
> >>      default:
> >> @@ -1351,6 +1352,7 @@ static void fill_source_migration_info(MigrationInfo 
> >> *info)
> >>          break;
> >>      case MIGRATION_STATUS_ACTIVE:
> >>      case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_FAILING:
> >>      case MIGRATION_STATUS_POSTCOPY_DEVICE:
> >>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> >>      case MIGRATION_STATUS_PRE_SWITCHOVER:
> >> @@ -1409,6 +1411,7 @@ static void 
> >> fill_destination_migration_info(MigrationInfo *info)
> >>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> >>      case MIGRATION_STATUS_POSTCOPY_PAUSED:
> >>      case MIGRATION_STATUS_POSTCOPY_RECOVER:
> >> +    case MIGRATION_STATUS_FAILING:
> >>      case MIGRATION_STATUS_FAILED:
> >>      case MIGRATION_STATUS_COLO:
> >>          info->has_status = true;
> >> @@ -1531,6 +1534,9 @@ static void migration_cleanup(MigrationState *s)
> >>      if (s->state == MIGRATION_STATUS_CANCELLING) {
> >>          migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
> >>                            MIGRATION_STATUS_CANCELLED);
> >> +    } else if (s->state == MIGRATION_STATUS_FAILING) {
> >> +        migrate_set_state(&s->state, MIGRATION_STATUS_FAILING,
> >> +                          MIGRATION_STATUS_FAILED);
> >>      }
> >>  
> >>      if (s->error) {
> >> @@ -1584,7 +1590,7 @@ static void 
> >> migration_connect_set_error(MigrationState *s, const Error *error)
> >>  
> >>      switch (current) {
> >>      case MIGRATION_STATUS_SETUP:
> >> -        next = MIGRATION_STATUS_FAILED;
> >> +        next = MIGRATION_STATUS_FAILING;
> >
> > This is the first real change that we'll switch to FAILING when
> > migration_connect_set_error() is invoked and migration failed.
> >
> > Please justify why setting FAILING is correct here.
> >
> > This function is invoked in three callers:
> >
> > qmp_migrate[2302]              migration_connect_set_error(s, local_err);
> > qmp_migrate_finish[2347]       migration_connect_set_error(s, local_err);
> > migration_connect[4047]        migration_connect_set_error(s, error_in);
> >
> > At least from the initial two callers, I don't see migration_cleanup()
> > invoked after setting FAILING.  Could this cause migration to get into
> > FAILING status forever without finally move to FAILED?
> >
> 
> Good point, I'm working on some cleanups to connection code and one
> change I did there is to add a migration_cleanup() call into
> migration_connect_error_propagate().

Yeh, this sounds like a good thing in general.  A few quick comments
below.

> 
> 1) branch WIP:
> https://gitlab.com/farosas/qemu/-/commits/migration-connect-cleanup
> 
> 2) the patch:
> ---
> From 1f9eeb898f3a5efba7c183e351fa36a5471fd0b2 Mon Sep 17 00:00:00 2001
> From: Fabiano Rosas <[email protected]>
> Date: Thu, 18 Dec 2025 10:54:46 -0300
> Subject: [PATCH] migration: Fold migration_cleanup into
>  migration_connect_error_propagate
> 
> The code path from qmp_migrate() until the migration thread starts is
> a little twisty due to the async nature of some routines. One issue is
> that the async functions cannot return errors to their callers and
> must instead call forward into migration_channel_connect() and pass
> the error as input.
> 
> Ideally we'd have a function that just receives the error as input and
> handles it. However, currently migration_channel_connect() has a code
> path that moves forward into migration_connect(), also passing the
> error as input, only for migration_connect() to then check that an
> error happened and call migration_cleanup().
> 
> Clean this up:
> 
> 1) Make migration_connect_error_propagate() be the function that
> handles the error and call it at the point the error happens in the
> async code. (this is all migration code, there's no layering
> violation)
> 
> 2) Stop checking for an incoming error in migration_connect(), that
> function should be only reached when everything that came before
> succeeded.
> 
> 3) Fold migration_cleanup() into migration_connect_error_propagate()
> so the cleanup happens at the moment the error is detected and not
> several calls down the stack.
> 
> 4) To address the quirk that during postcopy recovery there should be
> no cleanup, move that check into migration_cleanup() and return early
> if doing resume.
> 
> Notable functional changes:
> 
> i) Assumes a larger window for what it means to be "in resume"
>    before: from qmp_migrate until migration_connect
>    after: from qmp_migration until the state transition into
>           MIGRATION_STATUS_POSTCOPY_RECOVER
> 
> ii) After an error, MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP changes to
>     MIGRATION_STATUS_POSTCOPY_PAUSED.
> 
>     This is already the case when migration_connect_error_propagate()
>     was used, but not when migration_connect() receives
>     error_in. Seems like a pre-existing bug actually.
> 
> iii) If the socket_start_outgoing_migration function *returns* an
>      error, now migration_cleanup() is called. Previously, cleanup
>      only happened when the error was *passed* forward, i.e. only
>      after the async call.
> 
> iv) If cpr_state_save() fails, now migration_cleanup() is called.
> 
> Signed-off-by: Fabiano Rosas <[email protected]>
> ---
>  migration/migration.c | 35 ++++++++++++++++++-----------------
>  1 file changed, 18 insertions(+), 17 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 4b1afcab24..52c1a97e46 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1481,6 +1481,14 @@ static void migration_cleanup(MigrationState *s)
>      MigrationEventType type;
>      QEMUFile *tmp = NULL;
>  
> +    /*
> +     * Don't do cleanup if we're waiting for another connection from
> +     * the user.
> +     */
> +    if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP) {
> +        return;
> +    }

I wonder if QEMU will be in RECOVER_SETUP when reaching here..  shouldn't
below [2] already moved it to PAUSED for a postcopy?

The other thing is, migration_cleanup() currently also takes the job of
dumping the error to stderr:

    if (s->error) {
        /* It is used on info migrate.  We can't free it */
        error_report_err(error_copy(s->error));
    }

I wonder if we could merge [1] below, then here when it's PAUSED we "goto"
that part (we may need to move the error_report_err to the end of
migration_cleanup, though, please double check).

> +
>      trace_migration_cleanup();
>  
>      migration_cleanup_json_writer(s);
> @@ -1585,6 +1593,14 @@ static void 
> migration_connect_error_propagate(MigrationState *s, Error *error)
>      case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
>          /* Never fail a postcopy migration; switch back to PAUSED instead */
>          next = MIGRATION_STATUS_POSTCOPY_PAUSED;
> +
> +        /*
> +         * Give HMP user a hint on what failed.  It's normally done in
> +         * migration_cleanup(), but call it here explicitly because we
> +         * don't do cleanup when waiting for postcopy recover.
> +         */
> +        error_report_err(error_copy(error));

[1]

> +
>          break;
>      default:
>          /*
> @@ -1598,6 +1614,7 @@ static void 
> migration_connect_error_propagate(MigrationState *s, Error *error)
>  
>      migrate_set_state(&s->state, current, next);

[2]

>      migrate_error_propagate(s, error);
> +    migration_cleanup(s);
>  }
>  
>  void migration_cancel(void)
> @@ -2326,12 +2343,8 @@ static void qmp_migrate_finish(MigrationAddress *addr, 
> bool resume_requested,
>      }
>  
>      if (local_err) {
> -        if (!resume_requested) {
> -            yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> -        }
> -        migration_connect_error_propagate(s, error_copy(local_err));
> +        migration_connect_error_propagate(s, local_err);

I feel like this is a rebase accident.. now after renaming this function to
migration_connect_error_propagate() it'll take over the ownership, so IIUC
error_copy() will be needed here (as error_propagate() right below will
also take the ownership..).

>          error_propagate(errp, local_err);
> -        return;
>      }
>  }
>  
> @@ -4026,18 +4039,6 @@ void migration_connect(MigrationState *s, Error 
> *error_in)
>      s->expected_downtime = migrate_downtime_limit();
>      if (error_in) {
>          migration_connect_error_propagate(s, error_in);
> -        if (resume) {
> -            /*
> -             * Don't do cleanup for resume if channel is invalid, but only 
> dump
> -             * the error.  We wait for another channel connect from the user.
> -             * The error_report still gives HMP user a hint on what failed.
> -             * It's normally done in migration_cleanup(), but call it here
> -             * explicitly.
> -             */
> -            error_report_err(error_copy(s->error));
> -        } else {
> -            migration_cleanup(s);
> -        }

Yeah this part looks better.

>          return;
>      }
>  
> -- 
> 

-- 
Peter Xu


Reply via email to