Yong Huang <yong.hu...@smartx.com> writes: > On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstra...@web.de> wrote: > >> On Fri, 8 Aug 2025 10:36:24 +0800 >> Yong Huang <yong.hu...@smartx.com> wrote: >> >> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstra...@web.de> wrote: >> > >> > > On Thu, 7 Aug 2025 10:41:17 +0800 >> > > yong.hu...@smartx.com wrote: >> > > >> > > > From: Hyman Huang <yong.hu...@smartx.com> >> > > > >> > > > When there are network issues like missing TCP ACKs on the send >> > > > side during the multifd live migration. At the send side, the error >> > > > "Connection timed out" is thrown out and source QEMU process stop >> > > > sending data, at the receive side, The IO-channels may be blocked >> > > > at recvmsg() and thus the main loop gets stuck and fails to respond >> > > > to QMP commands consequently. >> > > > ... >> > > >> > > Hi Hyman Huang, >> > > >> > > Have you tried the 'yank' command to shutdown the sockets? It exactly >> > > meant to recover from hangs and should solve your issue. >> > > >> > > >> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature >> > >> > >> > Thanks for the comment and advice. >> > >> > Let me give more details about the migration state when the issue >> happens: >> > >> > On the source side, libvirt has already aborted the migration job: >> > >> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63 >> > Job type: Failed >> > Operation: Outgoing migration >> > >> > QMP query-yank shows that there is no migration yank instance: >> > >> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63 >> > '{"execute":"query-yank"}' --pretty >> > { >> > "return": [ >> > { >> > "type": "chardev", >> > "id": "charmonitor" >> > }, >> > { >> > "type": "chardev", >> > "id": "charchannel0" >> > }, >> > { >> > "type": "chardev", >> > "id": "libvirt-2-virtio-format" >> > } >> > ], >> > "id": "libvirt-5217" >> > } >> >> You are supposed to run it on the destination side, there the migration >> yank instance should be present if qemu hangs in the migration code. >> >> Also, you need to execute it as an out-of-band command to bypass the >> main loop. Like this: >> >> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type": >> "migration"} ] } }' > > In our case, Libvirt's operation about the VM on the destination side has > been blocked > by Migration JOB: > > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63 > '{"query-commands"}' --pretty > error: Timed out during operation: cannot acquire state change lock (held > by monitor=remoteDispatchDomainMigratePrepare3Params) > Using Libvirt to issue the yank command can not be taken into account. > > >> >> >> I'm not sure if libvirt can do that, maybe you need to add an >> additional qmp socket and do it outside of libvirt. Note that you need >> to enable the oob feature during qmp negotiation, like this: >> >> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }' > > > No, I checked Libvirt's source code and figured out that when the QEMU > monitor is initialized, Libvirt by default disables the OOB. > > Therefore, perhaps we can first enable the OOB and add the yank capability > to Libvirt then adding the yank logic to the necessary path—in our > instance, the migration code: > > qemuMigrationDstFinish: > if (retcode != 0) { > /* Check for a possible error on the monitor in case Finish was called > * earlier than monitor EOF handler got a chance to process the error > */ > qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN); > goto endjob; > } > > > >> >> Regards, >> Lukas Straub >> >> > >> > The libvirt migration job is stuck as the following backtrace shows; it >> > shows that migration is waiting for the "Finish" RPC on the destination >> > side to return. >> > >> > ... >> > >> > IMHO, the key reason for the issue is that QEMU fails to run the main >> loop >> > and fails to respond to QMP, which is not what we usually expected. >> > >> > Giving the Libvirt a window of time to issue a QMP and kill the VM is the >> > ideal solution for this issue; this provides an automatic method. >> > >> > I do not dig the yank feature, perhaps it is helpful, but only manually? >> > >> > After all, these two options are not exclusive of one another, I think. >> >
Please work with Lukas to figure out whether yank can be used here. I think that's the correct approach. If the main loop is blocked, then some out-of-band cancellation routine is needed. migrate_cancel() could be it, but at the moment it's not. Yank is the second best thing. The need for a timeout is usually indicative of a design issue. In this case, the choice of a coroutine for the incoming side is the obvious one. Peter will tell you all about it! =)