Yong Huang <yong.hu...@smartx.com> writes:

> On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstra...@web.de> wrote:
>
>> On Fri, 8 Aug 2025 10:36:24 +0800
>> Yong Huang <yong.hu...@smartx.com> wrote:
>>
>> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstra...@web.de> wrote:
>> >
>> > > On Thu,  7 Aug 2025 10:41:17 +0800
>> > > yong.hu...@smartx.com wrote:
>> > >
>> > > > From: Hyman Huang <yong.hu...@smartx.com>
>> > > >
>> > > > When there are network issues like missing TCP ACKs on the send
>> > > > side during the multifd live migration. At the send side, the error
>> > > > "Connection timed out" is thrown out and source QEMU process stop
>> > > > sending data, at the receive side, The IO-channels may be blocked
>> > > > at recvmsg() and thus the main loop gets stuck and fails to respond
>> > > > to QMP commands consequently.
>> > > > ...
>> > >
>> > > Hi Hyman Huang,
>> > >
>> > > Have you tried the 'yank' command to shutdown the sockets? It exactly
>> > > meant to recover from hangs and should solve your issue.
>> > >
>> > >
>> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature
>> >
>> >
>> > Thanks for the comment and advice.
>> >
>> > Let me give more details about the migration state when the issue
>> happens:
>> >
>> > On the source side, libvirt has already aborted the migration job:
>> >
>> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63
>> > Job type:         Failed
>> > Operation:        Outgoing migration
>> >
>> > QMP query-yank shows that there is no migration yank instance:
>> >
>> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
>> > '{"execute":"query-yank"}' --pretty
>> > {
>> >   "return": [
>> >     {
>> >       "type": "chardev",
>> >       "id": "charmonitor"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "charchannel0"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "libvirt-2-virtio-format"
>> >     }
>> >   ],
>> >   "id": "libvirt-5217"
>> > }
>>
>> You are supposed to run it on the destination side, there the migration
>> yank instance should be present if qemu hangs in the migration code.
>>
>> Also, you need to execute it as an out-of-band command to bypass the
>> main loop. Like this:
>>
>> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type":
>> "migration"} ] } }'
>
> In our case, Libvirt's operation about the VM on the destination side has
> been blocked
> by Migration JOB:
>
> $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> '{"query-commands"}' --pretty
> error: Timed out during operation: cannot acquire state change lock (held
> by monitor=remoteDispatchDomainMigratePrepare3Params)
> Using Libvirt to issue the yank command can not be taken into account.
>
>
>>
>>
>> I'm not sure if libvirt can do that, maybe you need to add an
>> additional qmp socket and do it outside of libvirt. Note that you need
>> to enable the oob feature during qmp negotiation, like this:
>>
>> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }'
>
>
> No, I checked Libvirt's source code and figured out that when the QEMU
> monitor is initialized, Libvirt by default disables the OOB.
>
> Therefore, perhaps we can first enable the OOB and add the yank capability
> to Libvirt then adding the yank logic to the necessary path—in our
> instance, the migration code:
>
> qemuMigrationDstFinish:
>     if (retcode != 0) {
>         /* Check for a possible error on the monitor in case Finish was called
>          * earlier than monitor EOF handler got a chance to process the error
>          */
>         qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN);
>         goto endjob;
>     }
>
>
>
>>
>> Regards,
>> Lukas Straub
>>
>> >
>> > The libvirt migration job is stuck as the following backtrace shows; it
>> > shows that migration is waiting for the "Finish" RPC on the destination
>> > side to return.
>> >
>> > ...
>> >
>> > IMHO, the key reason for the issue is that QEMU fails to run the main
>> loop
>> > and fails to respond to QMP, which is not what we usually expected.
>> >
>> > Giving the Libvirt a window of time to issue a QMP and kill the VM is the
>> > ideal solution for this issue; this provides an automatic method.
>> >
>> > I do not dig the yank feature, perhaps it is helpful, but only manually?
>> >
>> > After all, these two options are not exclusive of one another,  I think.
>> >

Please work with Lukas to figure out whether yank can be used here. I
think that's the correct approach. If the main loop is blocked, then
some out-of-band cancellation routine is needed. migrate_cancel() could
be it, but at the moment it's not. Yank is the second best thing.

The need for a timeout is usually indicative of a design issue. In this
case, the choice of a coroutine for the incoming side is the obvious
one. Peter will tell you all about it! =)

Reply via email to