Hello Fabiano,

* Thanks so much for the quick comments.

On Mon, 22 Dec 2025 at 20:00, Fabiano Rosas <[email protected]> wrote:
> This doesn't look like it's caused by set-capabilities. Seems like:
> Please clarify, there might be some other bug lurking around, such as a
> locking issue with qemu_file_lock or even the BQL.
>
> Also, if possible provide an upstream backtrace, or at least mention the
> code path based on upstream code. It's ok to keep the downstream
> backtrace, but point that out in the commit message.

* Right, migrate_fd_cleanup was renamed to migration_cleanup() in
    -> 
https://gitlab.com/qemu-project/qemu/-/commit/4bbadfc55e6ec608df75911b4360e6e995daa28c
===
libvirtd(8) 19:03:07.260+0000: 396587: error :
qemuMigrationJobCheckStatus:2056 : operation failed: job 'migration
out' failed: Unable to write to socket: Connection reset by peer
libvirtd(8) 19:03:07.261+0000: 396587: info : qemuMonitorSend:836 :
QEMU_MONITOR_SEND_MSG: mon=0xffffa000e010
msg={"execute":"migrate-set-capabilities","arguments":


qemu-system-aarch64:initiating migration
qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 1: 0xaaaaccce0110
qemu-system-aarch64: migrate_fd_cleanup: before multifd_send_shutdown:
0   <== multifd disabled
qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 2: (nil)
qemu-system-aarch64: Unable to write to socket: Connection reset by peer
qemu-system-aarch64: ../util/yank.c:107: yank_unregister_instance:
Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
qemu-system-aarch64:shutting down, reason=crashed
===
* As seen above, when connection is reset
     1) libvirtd(8) sends the migrate-set-capabilities command to QEMU
to reset the migration options to false(0). This leads to resetting of
s->capabilities[MIGRATION_CAPABILITY_MULTIFD]  to false.
     2) When migration_cleanup (earlier migrate_fd_cleanup) is called
after above reset
          migration_cleanup
           -> multifd_send_shutdown
            -> if (!migrate_multifd()) {
                     return;   <== returns because _CAPABILITY_MULTIFD
is reset to false(0).
                }
         when _CAPABILITY_MULTIFD is reset to false,
multifd_send_shutdown() returns without really doing its multifd
cleanup, ie. multifd objects still stay alive, are not freed.

* And that leads to the said assert(3) failure in
     migration_cleanup()
     {
         ...
         multifd_send_shutdown()  <== does not cleanup
         ...
         yank_unregister_instance(MIGRATION_YANK_INSTANCE);   <==
assert(3) failure
     }

> I'm fine with the general idea:
>
> i) FAILED and CANCELLED are terminal states. It makes sense to not have
> work happen after they're set.
>
> ii) Using an intermediate state, assuming locking/atomic are correct is
> a suitable fix for the issue.
>
> iii) Using a FAILING status seems appropriate.
>
> However,
>
> It would be great if we could stop exposing implementation details via
> QAPI. Does the user really need to see events for CANCELLING and
> FAILING?

* Not really; libvirtd(8)/user only needs to know about the success OR
failure and appropriate reasons.

> This is a good example where having MigrationStatus makes the code more
> complicated. This could be exiting=true, running=false, etc. It
> shouldn't matter why this routine failed.

* True, it is rather complicated. Though currently we have 17-18
migration states defined, going by the functions
migration_is_running(), migration_is_active(), and
migration_has_failed(), migration process really has only 3-4 states:
      0 -> stopped (non-active, not-started-yet)
      1 -> running/active (sending-receive migration data)
      2 -> paused/active (not-running, not-send-recv-migration-data)
      3 -> stopped/failed/completed (non-active, not-running)
Other interim states of initiating/setting connections OR
cancelling/failing etc. could be tracked differently.


Thank you.
---
  - Prasad


Reply via email to