On Fri, Aug 8, 2025 at 9:55 PM Fabiano Rosas <faro...@suse.de> wrote:

> Yong Huang <yong.hu...@smartx.com> writes:
>
> > On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstra...@web.de> wrote:
> >
> >> On Fri, 8 Aug 2025 10:36:24 +0800
> >> Yong Huang <yong.hu...@smartx.com> wrote:
> >>
> >> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstra...@web.de>
> wrote:
> >> >
> >> > > On Thu,  7 Aug 2025 10:41:17 +0800
> >> > > yong.hu...@smartx.com wrote:
> >> > >
> >> > > > From: Hyman Huang <yong.hu...@smartx.com>
> >> > > >
> >> > > > When there are network issues like missing TCP ACKs on the send
> >> > > > side during the multifd live migration. At the send side, the
> error
> >> > > > "Connection timed out" is thrown out and source QEMU process stop
> >> > > > sending data, at the receive side, The IO-channels may be blocked
> >> > > > at recvmsg() and thus the main loop gets stuck and fails to
> respond
> >> > > > to QMP commands consequently.
> >> > > > ...
> >> > >
> >> > > Hi Hyman Huang,
> >> > >
> >> > > Have you tried the 'yank' command to shutdown the sockets? It
> exactly
> >> > > meant to recover from hangs and should solve your issue.
> >> > >
> >> > >
> >> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature
> >> >
> >> >
> >> > Thanks for the comment and advice.
> >> >
> >> > Let me give more details about the migration state when the issue
> >> happens:
> >> >
> >> > On the source side, libvirt has already aborted the migration job:
> >> >
> >> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63
> >> > Job type:         Failed
> >> > Operation:        Outgoing migration
> >> >
> >> > QMP query-yank shows that there is no migration yank instance:
> >> >
> >> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> >> > '{"execute":"query-yank"}' --pretty
> >> > {
> >> >   "return": [
> >> >     {
> >> >       "type": "chardev",
> >> >       "id": "charmonitor"
> >> >     },
> >> >     {
> >> >       "type": "chardev",
> >> >       "id": "charchannel0"
> >> >     },
> >> >     {
> >> >       "type": "chardev",
> >> >       "id": "libvirt-2-virtio-format"
> >> >     }
> >> >   ],
> >> >   "id": "libvirt-5217"
> >> > }
> >>
> >> You are supposed to run it on the destination side, there the migration
> >> yank instance should be present if qemu hangs in the migration code.
> >>
> >> Also, you need to execute it as an out-of-band command to bypass the
> >> main loop. Like this:
> >>
> >> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [
> {"type":
> >> "migration"} ] } }'
> >
> > In our case, Libvirt's operation about the VM on the destination side has
> > been blocked
> > by Migration JOB:
> >
> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> > '{"query-commands"}' --pretty
> > error: Timed out during operation: cannot acquire state change lock (held
> > by monitor=remoteDispatchDomainMigratePrepare3Params)
> > Using Libvirt to issue the yank command can not be taken into account.
> >
> >
> >>
> >>
> >> I'm not sure if libvirt can do that, maybe you need to add an
> >> additional qmp socket and do it outside of libvirt. Note that you need
> >> to enable the oob feature during qmp negotiation, like this:
> >>
> >> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] }
> }'
> >
> >
> > No, I checked Libvirt's source code and figured out that when the QEMU
> > monitor is initialized, Libvirt by default disables the OOB.
> >
> > Therefore, perhaps we can first enable the OOB and add the yank
> capability
> > to Libvirt then adding the yank logic to the necessary path—in our
> > instance, the migration code:
> >
> > qemuMigrationDstFinish:
> >     if (retcode != 0) {
> >         /* Check for a possible error on the monitor in case Finish was
> called
> >          * earlier than monitor EOF handler got a chance to process the
> error
> >          */
> >         qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN);
> >         goto endjob;
> >     }
> >
> >
> >
> >>
> >> Regards,
> >> Lukas Straub
> >>
> >> >
> >> > The libvirt migration job is stuck as the following backtrace shows;
> it
> >> > shows that migration is waiting for the "Finish" RPC on the
> destination
> >> > side to return.
> >> >
> >> > ...
> >> >
> >> > IMHO, the key reason for the issue is that QEMU fails to run the main
> >> loop
> >> > and fails to respond to QMP, which is not what we usually expected.
> >> >
> >> > Giving the Libvirt a window of time to issue a QMP and kill the VM is
> the
> >> > ideal solution for this issue; this provides an automatic method.
> >> >
> >> > I do not dig the yank feature, perhaps it is helpful, but only
> manually?
> >> >
> >> > After all, these two options are not exclusive of one another,  I
> think.
> >> >
>
> Please work with Lukas to figure out whether yank can be used here. I
> think that's the correct approach. If the main loop is blocked, then
> some out-of-band cancellation routine is needed. migrate_cancel() could
> be it, but at the moment it's not. Yank is the second best thing.


Ok, get it.


>
>
> The need for a timeout is usually indicative of a design issue. In this
> case, the choice of a coroutine for the incoming side is the obvious
> one. Peter will tell you all about it! =)
>


-- 
Best regards

Reply via email to