Hello Fabiano, * Thanks so much for the quick comments.
On Mon, 22 Dec 2025 at 20:00, Fabiano Rosas <[email protected]> wrote: > This doesn't look like it's caused by set-capabilities. Seems like: > Please clarify, there might be some other bug lurking around, such as a > locking issue with qemu_file_lock or even the BQL. > > Also, if possible provide an upstream backtrace, or at least mention the > code path based on upstream code. It's ok to keep the downstream > backtrace, but point that out in the commit message. * Right, migrate_fd_cleanup was renamed to migration_cleanup() in -> https://gitlab.com/qemu-project/qemu/-/commit/4bbadfc55e6ec608df75911b4360e6e995daa28c === libvirtd(8) 19:03:07.260+0000: 396587: error : qemuMigrationJobCheckStatus:2056 : operation failed: job 'migration out' failed: Unable to write to socket: Connection reset by peer libvirtd(8) 19:03:07.261+0000: 396587: info : qemuMonitorSend:836 : QEMU_MONITOR_SEND_MSG: mon=0xffffa000e010 msg={"execute":"migrate-set-capabilities","arguments": qemu-system-aarch64:initiating migration qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 1: 0xaaaaccce0110 qemu-system-aarch64: migrate_fd_cleanup: before multifd_send_shutdown: 0 <== multifd disabled qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 2: (nil) qemu-system-aarch64: Unable to write to socket: Connection reset by peer qemu-system-aarch64: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. qemu-system-aarch64:shutting down, reason=crashed === * As seen above, when connection is reset 1) libvirtd(8) sends the migrate-set-capabilities command to QEMU to reset the migration options to false(0). This leads to resetting of s->capabilities[MIGRATION_CAPABILITY_MULTIFD] to false. 2) When migration_cleanup (earlier migrate_fd_cleanup) is called after above reset migration_cleanup -> multifd_send_shutdown -> if (!migrate_multifd()) { return; <== returns because _CAPABILITY_MULTIFD is reset to false(0). } when _CAPABILITY_MULTIFD is reset to false, multifd_send_shutdown() returns without really doing its multifd cleanup, ie. multifd objects still stay alive, are not freed. * And that leads to the said assert(3) failure in migration_cleanup() { ... multifd_send_shutdown() <== does not cleanup ... yank_unregister_instance(MIGRATION_YANK_INSTANCE); <== assert(3) failure } > I'm fine with the general idea: > > i) FAILED and CANCELLED are terminal states. It makes sense to not have > work happen after they're set. > > ii) Using an intermediate state, assuming locking/atomic are correct is > a suitable fix for the issue. > > iii) Using a FAILING status seems appropriate. > > However, > > It would be great if we could stop exposing implementation details via > QAPI. Does the user really need to see events for CANCELLING and > FAILING? * Not really; libvirtd(8)/user only needs to know about the success OR failure and appropriate reasons. > This is a good example where having MigrationStatus makes the code more > complicated. This could be exiting=true, running=false, etc. It > shouldn't matter why this routine failed. * True, it is rather complicated. Though currently we have 17-18 migration states defined, going by the functions migration_is_running(), migration_is_active(), and migration_has_failed(), migration process really has only 3-4 states: 0 -> stopped (non-active, not-started-yet) 1 -> running/active (sending-receive migration data) 2 -> paused/active (not-running, not-send-recv-migration-data) 3 -> stopped/failed/completed (non-active, not-running) Other interim states of initiating/setting connections OR cancelling/failing etc. could be tracked differently. Thank you. --- - Prasad
