Peter Xu <[email protected]> writes:
> On Mon, Dec 22, 2025 at 11:29:57AM -0300, Fabiano Rosas wrote:
>> I'm fine with the general idea:
>>
>> i) FAILED and CANCELLED are terminal states. It makes sense to not have
>> work happen after they're set.
>>
>> ii) Using an intermediate state, assuming locking/atomic are correct is
>> a suitable fix for the issue.
>>
>> iii) Using a FAILING status seems appropriate.
>>
>> However,
>>
>> It would be great if we could stop exposing implementation details via
>> QAPI. Does the user really need to see events for CANCELLING and
>> FAILING?
>>
>> It would probably be easier if we kept MigrationStatus as QAPI only and
>> used a separate mechanism to track the internal states.
>>
>> That said, we could merge this as is to fix the bug and think about that
>> later.
>
> This bug looks to be there for a long time, IMHO we don't need to rush
> fixing it if we risk adding a new status and revert it quickly... Let's
> discuss it here, and it's a valid question indeed.
>
> One thing good about exposing such status via QAPI is, it can help us
> diagnose issues by seeing CANCELLING / FAILING even looking at
> query-migrate results (as normally when bug happens we can't see the
> internal status..), so that we know either it's explicitly cancelled, or
> something went wrong.
>
> If it's a completely hidden / internal status, we may see ACTIVE even if
> something wrong happened..
>
On the other hand, we could have more fine-grained statuses and track
them with tracepoints.
> My current hope is any mgmt should normally by default ignore new migration
> states.. If that's always achieved, it looks to me adding FAILING directly
> into migration status would still have some benefits on debugging.
>
Maybe for FAILING it's ok, as we already have CANCELLING and FAILED is
currently mismatched.
> [...]
>
>> > @@ -2907,7 +2914,7 @@ fail_closefb:
>> > qemu_fclose(fb);
>> > fail:
>> > if (ms->state != MIGRATION_STATUS_CANCELLING) {
>> > - migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> > + migrate_set_state(&ms->state, ms->state,
>> > MIGRATION_STATUS_FAILING);
>> > }
>>
>> This is a good example where having MigrationStatus makes the code more
>> complicated. This could be exiting=true, running=false, etc. It
>> shouldn't matter why this routine failed. If we reach
>> migration_cleanup() and, at the very end, state is CANCELLING, then we
>> know the cancel command has caused this and set the state to
>> CANCELLED. If the state was something else, then it's unintended and we
>> set FAILED.
>
> If it'll be an internal status, we'll still need to identify if someone
> already have cancelled it, right?
>
I was thinking of keeping CANCELLING, since it's already in the API.
> Assuming we introduce stop_reason flag, making it:
>
> enum {
> MIG_STOP_REASON_CANCEL,
> MIG_STOP_REASON_FAIL,
> } MigrationStopReason;
>
> Then we can switch to CANCELLED / FAILED when cleanup from those reasons.
>
> Then here, logically we also need logic like:
>
> if (stop_reason != MIG_STOP_REASON_CANCEL) {
> stop_reason = MIG_STOP_REASON_FAIL;
> }
>
> Because we want to make sure when the user already triggered cancel, it
> won't show FAILED but only show CANCELLED at last?