Re: [PATCH] migration: introduce MIGRATION_STATUS_FAILING

Fabiano Rosas Tue, 23 Dec 2025 08:02:38 -0800

Peter Xu <[email protected]> writes:

> On Mon, Dec 22, 2025 at 11:29:57AM -0300, Fabiano Rosas wrote:
>> I'm fine with the general idea:
>> 
>> i) FAILED and CANCELLED are terminal states. It makes sense to not have
>> work happen after they're set.
>> 
>> ii) Using an intermediate state, assuming locking/atomic are correct is
>> a suitable fix for the issue.
>> 
>> iii) Using a FAILING status seems appropriate.
>> 
>> However,
>> 
>> It would be great if we could stop exposing implementation details via
>> QAPI. Does the user really need to see events for CANCELLING and
>> FAILING?
>> 
>> It would probably be easier if we kept MigrationStatus as QAPI only and
>> used a separate mechanism to track the internal states.
>> 
>> That said, we could merge this as is to fix the bug and think about that
>> later.
>
> This bug looks to be there for a long time, IMHO we don't need to rush
> fixing it if we risk adding a new status and revert it quickly...  Let's
> discuss it here, and it's a valid question indeed.
>
> One thing good about exposing such status via QAPI is, it can help us
> diagnose issues by seeing CANCELLING / FAILING even looking at
> query-migrate results (as normally when bug happens we can't see the
> internal status..), so that we know either it's explicitly cancelled, or
> something went wrong.
>
> If it's a completely hidden / internal status, we may see ACTIVE even if
> something wrong happened..
>


On the other hand, we could have more fine-grained statuses and track
them with tracepoints.

> My current hope is any mgmt should normally by default ignore new migration
> states..  If that's always achieved, it looks to me adding FAILING directly
> into migration status would still have some benefits on debugging.
>

Maybe for FAILING it's ok, as we already have CANCELLING and FAILED is
currently mismatched.

> [...]
>
>> > @@ -2907,7 +2914,7 @@ fail_closefb:
>> >      qemu_fclose(fb);
>> >  fail:
>> >      if (ms->state != MIGRATION_STATUS_CANCELLING) {
>> > -        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> > +        migrate_set_state(&ms->state, ms->state, 
>> > MIGRATION_STATUS_FAILING);
>> >      }
>> 
>> This is a good example where having MigrationStatus makes the code more
>> complicated. This could be exiting=true, running=false, etc. It
>> shouldn't matter why this routine failed. If we reach
>> migration_cleanup() and, at the very end, state is CANCELLING, then we
>> know the cancel command has caused this and set the state to
>> CANCELLED. If the state was something else, then it's unintended and we
>> set FAILED.
>
> If it'll be an internal status, we'll still need to identify if someone
> already have cancelled it, right?
>

I was thinking of keeping CANCELLING, since it's already in the API.

> Assuming we introduce stop_reason flag, making it:
>
> enum {
>     MIG_STOP_REASON_CANCEL,
>     MIG_STOP_REASON_FAIL,
> } MigrationStopReason;
>
> Then we can switch to CANCELLED / FAILED when cleanup from those reasons.
>
> Then here, logically we also need logic like:
>
>     if (stop_reason != MIG_STOP_REASON_CANCEL) {
>         stop_reason = MIG_STOP_REASON_FAIL;
>     }
>
> Because we want to make sure when the user already triggered cancel, it
> won't show FAILED but only show CANCELLED at last?

Re: [PATCH] migration: introduce MIGRATION_STATUS_FAILING

Reply via email to