On Fri, Aug 8, 2025 at 9:55 PM Fabiano Rosas <faro...@suse.de> wrote:
> Yong Huang <yong.hu...@smartx.com> writes: > > > On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstra...@web.de> wrote: > > > >> On Fri, 8 Aug 2025 10:36:24 +0800 > >> Yong Huang <yong.hu...@smartx.com> wrote: > >> > >> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstra...@web.de> > wrote: > >> > > >> > > On Thu, 7 Aug 2025 10:41:17 +0800 > >> > > yong.hu...@smartx.com wrote: > >> > > > >> > > > From: Hyman Huang <yong.hu...@smartx.com> > >> > > > > >> > > > When there are network issues like missing TCP ACKs on the send > >> > > > side during the multifd live migration. At the send side, the > error > >> > > > "Connection timed out" is thrown out and source QEMU process stop > >> > > > sending data, at the receive side, The IO-channels may be blocked > >> > > > at recvmsg() and thus the main loop gets stuck and fails to > respond > >> > > > to QMP commands consequently. > >> > > > ... > >> > > > >> > > Hi Hyman Huang, > >> > > > >> > > Have you tried the 'yank' command to shutdown the sockets? It > exactly > >> > > meant to recover from hangs and should solve your issue. > >> > > > >> > > > >> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature > >> > > >> > > >> > Thanks for the comment and advice. > >> > > >> > Let me give more details about the migration state when the issue > >> happens: > >> > > >> > On the source side, libvirt has already aborted the migration job: > >> > > >> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63 > >> > Job type: Failed > >> > Operation: Outgoing migration > >> > > >> > QMP query-yank shows that there is no migration yank instance: > >> > > >> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63 > >> > '{"execute":"query-yank"}' --pretty > >> > { > >> > "return": [ > >> > { > >> > "type": "chardev", > >> > "id": "charmonitor" > >> > }, > >> > { > >> > "type": "chardev", > >> > "id": "charchannel0" > >> > }, > >> > { > >> > "type": "chardev", > >> > "id": "libvirt-2-virtio-format" > >> > } > >> > ], > >> > "id": "libvirt-5217" > >> > } > >> > >> You are supposed to run it on the destination side, there the migration > >> yank instance should be present if qemu hangs in the migration code. > >> > >> Also, you need to execute it as an out-of-band command to bypass the > >> main loop. Like this: > >> > >> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ > {"type": > >> "migration"} ] } }' > > > > In our case, Libvirt's operation about the VM on the destination side has > > been blocked > > by Migration JOB: > > > > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63 > > '{"query-commands"}' --pretty > > error: Timed out during operation: cannot acquire state change lock (held > > by monitor=remoteDispatchDomainMigratePrepare3Params) > > Using Libvirt to issue the yank command can not be taken into account. > > > > > >> > >> > >> I'm not sure if libvirt can do that, maybe you need to add an > >> additional qmp socket and do it outside of libvirt. Note that you need > >> to enable the oob feature during qmp negotiation, like this: > >> > >> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } > }' > > > > > > No, I checked Libvirt's source code and figured out that when the QEMU > > monitor is initialized, Libvirt by default disables the OOB. > > > > Therefore, perhaps we can first enable the OOB and add the yank > capability > > to Libvirt then adding the yank logic to the necessary path—in our > > instance, the migration code: > > > > qemuMigrationDstFinish: > > if (retcode != 0) { > > /* Check for a possible error on the monitor in case Finish was > called > > * earlier than monitor EOF handler got a chance to process the > error > > */ > > qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN); > > goto endjob; > > } > > > > > > > >> > >> Regards, > >> Lukas Straub > >> > >> > > >> > The libvirt migration job is stuck as the following backtrace shows; > it > >> > shows that migration is waiting for the "Finish" RPC on the > destination > >> > side to return. > >> > > >> > ... > >> > > >> > IMHO, the key reason for the issue is that QEMU fails to run the main > >> loop > >> > and fails to respond to QMP, which is not what we usually expected. > >> > > >> > Giving the Libvirt a window of time to issue a QMP and kill the VM is > the > >> > ideal solution for this issue; this provides an automatic method. > >> > > >> > I do not dig the yank feature, perhaps it is helpful, but only > manually? > >> > > >> > After all, these two options are not exclusive of one another, I > think. > >> > > > Please work with Lukas to figure out whether yank can be used here. I > think that's the correct approach. If the main loop is blocked, then > some out-of-band cancellation routine is needed. migrate_cancel() could > be it, but at the moment it's not. Yank is the second best thing. Ok, get it. > > > The need for a timeout is usually indicative of a design issue. In this > case, the choice of a coroutine for the incoming side is the obvious > one. Peter will tell you all about it! =) > -- Best regards