On Fri, 8 Aug 2025 10:36:24 +0800 Yong Huang <yong.hu...@smartx.com> wrote:
> On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstra...@web.de> wrote: > > > On Thu, 7 Aug 2025 10:41:17 +0800 > > yong.hu...@smartx.com wrote: > > > > > From: Hyman Huang <yong.hu...@smartx.com> > > > > > > When there are network issues like missing TCP ACKs on the send > > > side during the multifd live migration. At the send side, the error > > > "Connection timed out" is thrown out and source QEMU process stop > > > sending data, at the receive side, The IO-channels may be blocked > > > at recvmsg() and thus the main loop gets stuck and fails to respond > > > to QMP commands consequently. > > > ... > > > > Hi Hyman Huang, > > > > Have you tried the 'yank' command to shutdown the sockets? It exactly > > meant to recover from hangs and should solve your issue. > > > > https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature > > > Thanks for the comment and advice. > > Let me give more details about the migration state when the issue happens: > > On the source side, libvirt has already aborted the migration job: > > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63 > Job type: Failed > Operation: Outgoing migration > > QMP query-yank shows that there is no migration yank instance: > > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63 > '{"execute":"query-yank"}' --pretty > { > "return": [ > { > "type": "chardev", > "id": "charmonitor" > }, > { > "type": "chardev", > "id": "charchannel0" > }, > { > "type": "chardev", > "id": "libvirt-2-virtio-format" > } > ], > "id": "libvirt-5217" > } You are supposed to run it on the destination side, there the migration yank instance should be present if qemu hangs in the migration code. Also, you need to execute it as an out-of-band command to bypass the main loop. Like this: '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type": "migration"} ] } }' I'm not sure if libvirt can do that, maybe you need to add an additional qmp socket and do it outside of libvirt. Note that you need to enable the oob feature during qmp negotiation, like this: '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }' Regards, Lukas Straub > > The libvirt migration job is stuck as the following backtrace shows; it > shows that migration is waiting for the "Finish" RPC on the destination > side to return. > > ... > > IMHO, the key reason for the issue is that QEMU fails to run the main loop > and fails to respond to QMP, which is not what we usually expected. > > Giving the Libvirt a window of time to issue a QMP and kill the VM is the > ideal solution for this issue; this provides an automatic method. > > I do not dig the yank feature, perhaps it is helpful, but only manually? > > After all, these two options are not exclusive of one another, I think. > > > > > > Best regards, > > Lukas Straub > > > > Thanks, > Yong >
pgpi8oRZCKnlr.pgp
Description: OpenPGP digital signature