On Mon, Dec 22, 2025 at 11:29:57AM -0300, Fabiano Rosas wrote:
> I'm fine with the general idea:
>
> i) FAILED and CANCELLED are terminal states. It makes sense to not have
> work happen after they're set.
>
> ii) Using an intermediate state, assuming locking/atomic are correct is
> a suitable fix for the issue.
>
> iii) Using a FAILING status seems appropriate.
>
> However,
>
> It would be great if we could stop exposing implementation details via
> QAPI. Does the user really need to see events for CANCELLING and
> FAILING?
>
> It would probably be easier if we kept MigrationStatus as QAPI only and
> used a separate mechanism to track the internal states.
>
> That said, we could merge this as is to fix the bug and think about that
> later.
This bug looks to be there for a long time, IMHO we don't need to rush
fixing it if we risk adding a new status and revert it quickly... Let's
discuss it here, and it's a valid question indeed.
One thing good about exposing such status via QAPI is, it can help us
diagnose issues by seeing CANCELLING / FAILING even looking at
query-migrate results (as normally when bug happens we can't see the
internal status..), so that we know either it's explicitly cancelled, or
something went wrong.
If it's a completely hidden / internal status, we may see ACTIVE even if
something wrong happened..
My current hope is any mgmt should normally by default ignore new migration
states.. If that's always achieved, it looks to me adding FAILING directly
into migration status would still have some benefits on debugging.
[...]
> > @@ -2907,7 +2914,7 @@ fail_closefb:
> > qemu_fclose(fb);
> > fail:
> > if (ms->state != MIGRATION_STATUS_CANCELLING) {
> > - migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> > + migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILING);
> > }
>
> This is a good example where having MigrationStatus makes the code more
> complicated. This could be exiting=true, running=false, etc. It
> shouldn't matter why this routine failed. If we reach
> migration_cleanup() and, at the very end, state is CANCELLING, then we
> know the cancel command has caused this and set the state to
> CANCELLED. If the state was something else, then it's unintended and we
> set FAILED.
If it'll be an internal status, we'll still need to identify if someone
already have cancelled it, right?
Assuming we introduce stop_reason flag, making it:
enum {
MIG_STOP_REASON_CANCEL,
MIG_STOP_REASON_FAIL,
} MigrationStopReason;
Then we can switch to CANCELLED / FAILED when cleanup from those reasons.
Then here, logically we also need logic like:
if (stop_reason != MIG_STOP_REASON_CANCEL) {
stop_reason = MIG_STOP_REASON_FAIL;
}
Because we want to make sure when the user already triggered cancel, it
won't show FAILED but only show CANCELLED at last?
--
Peter Xu