On Thu, Aug 7, 2025 at 11:04 AM <yong.hu...@smartx.com> wrote:

> From: Hyman Huang <yong.hu...@smartx.com>
>
> When there are network issues like missing TCP ACKs on the send
> side during the multifd live migration. At the send side, the error
> "Connection timed out" is thrown out and source QEMU process stop
> sending data, at the receive side, The IO-channels may be blocked
> at recvmsg() and thus the main loop gets stuck and fails to respond
> to QMP commands consequently.
>
> The QEMU backtrace at the receive side with the main thread and two
> multi-channel threads is displayed as follows:
>
> multifd thread 2:
> Thread 10 (Thread 0x7fd24d5fd700 (LWP 1413634)):
> 0  0x00007fd46066d157 in __libc_recvmsg (fd=46, msg=msg@entry=0x7fd24d5fc530,
> flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
> 1  0x00005556d52ffb1b in qio_channel_socket_readv (ioc=<optimized out>,
> iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
> flags=<optimized out>, errp=0x7fd24d5fc6f8) at ../io/channel-socket.c:513
> 2  0x00005556d530561f in qio_channel_readv_full_all_eof
> (ioc=0x5556d76db290, iov=<optimized out>, niov=<optimized out>, fds=0x0,
> nfds=0x0, errp=errp@entry=0x7fd24d5fc6f8) at ../io/channel.c:142
> 3  0x00005556d53057d9 in qio_channel_readv_full_all (ioc=<optimized out>,
> iov=<optimized out>, niov=<optimized out>, fds=<optimized out>,
> nfds=<optimized out>, errp=0x7fd24d5fc6f8) at ../io/channel.c:210
> 4  0x00005556d4fa4fc9 in multifd_recv_thread 
> (opaque=opaque@entry=0x5556d7affa60)
> at ../migration/multifd.c:1113
> 5  0x00005556d5414826 in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:556
> 6  0x00007fd460662f1b in start_thread (arg=0x7fd24d5fd700) at
> pthread_create.c:486
> 7  0x00007fd46059a1a0 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:98
>
> multifd thread 1:
> Thread 9 (Thread 0x7fd24ddfe700 (LWP 1413633)):
> 0  0x00007fd46066d157 in __libc_recvmsg (fd=44, msg=msg@entry=0x7fd24ddfd530,
> flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
> 1  0x00005556d52ffb1b in qio_channel_socket_readv (ioc=<optimized out>,
> iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
> flags=<optimized out>, errp=0x7fd24ddfd6f8) at ../io/channel-socket.c:513
> 2  0x00005556d530561f in qio_channel_readv_full_all_eof
> (ioc=0x5556d76dc600, iov=<optimized out>, niov=<optimized out>, fds=0x0,
> nfds=0x0, errp=errp@entry=0x7fd24ddfd6f8) at ../io/channel.c:142
> 3  0x00005556d53057d9 in qio_channel_readv_full_all (ioc=<optimized out>,
> iov=<optimized out>, niov=<optimized out>, fds=<optimized out>,
> nfds=<optimized out>, errp=0x7fd24ddfd6f8) at ../io/channel.c:210
> 4  0x00005556d4fa4fc9 in multifd_recv_thread 
> (opaque=opaque@entry=0x5556d7aff990)
> at ../migration/multifd.c:1113
> 5  0x00005556d5414826 in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:556
> 6  0x00007fd460662f1b in start_thread (arg=0x7fd24ddfe700) at
> pthread_create.c:486
> 7  0x00007fd46059a1a0 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:98
>
> main thread:
> Thread 1 (Thread 0x7fd45f1fbe40 (LWP 1413088)):
> 0  0x00007fd46066b616 in futex_abstimed_wait_cancelable (private=0,
> abstime=0x0, clockid=0, expected=0, futex_word=0x5556d7604e80) at
> ../sysdeps/unix/sysv/linux/futex-internal.h:216
> 1  do_futex_wait (sem=sem@entry=0x5556d7604e80, abstime=0x0) at
> sem_waitcommon.c:111
> 2  0x00007fd46066b708 in __new_sem_wait_slow (sem=sem@entry=0x5556d7604e80,
> abstime=0x0) at sem_waitcommon.c:183
> 3  0x00007fd46066b779 in __new_sem_wait (sem=sem@entry=0x5556d7604e80) at
> sem_wait.c:42
> 4  0x00005556d5415524 in qemu_sem_wait (sem=0x5556d7604e80) at
> ../util/qemu-thread-posix.c:358
> 5  0x00005556d4fa5e99 in multifd_recv_sync_main () at
> ../migration/multifd.c:1052
> 6  0x00005556d521ed65 in ram_load_precopy (f=f@entry=0x5556d75dfb90) at
> ../migration/ram.c:4446
> 7  0x00005556d521f1dd in ram_load (f=0x5556d75dfb90, opaque=<optimized
> out>, version_id=4) at ../migration/ram.c:4495
> 8  0x00005556d4faa3e7 in vmstate_load (f=f@entry=0x5556d75dfb90,
> se=se@entry=0x5556d6083070) at ../migration/savevm.c:909
> 9  0x00005556d4fae7a0 in qemu_loadvm_section_part_end (mis=0x5556d6082cc0,
> f=0x5556d75dfb90) at ../migration/savevm.c:2475
> 10 qemu_loadvm_state_main (f=f@entry=0x5556d75dfb90, 
> mis=mis@entry=0x5556d6082cc0)
> at ../migration/savevm.c:2634
> 11 0x00005556d4fafbd5 in qemu_loadvm_state (f=0x5556d75dfb90) at
> ../migration/savevm.c:2706
> 12 0x00005556d4f9ebdb in process_incoming_migration_co (opaque=<optimized
> out>) at ../migration/migration.c:561
> 13 0x00005556d542513b in coroutine_trampoline (i0=<optimized out>,
> i1=<optimized out>) at ../util/coroutine-ucontext.c:186
> 14 0x00007fd4604ef970 in ?? () from target:/lib64/libc.so.6
>
> Once the QEMU process falls into the above state in the presence of
> the network errors, live migration cannot be canceled gracefully,
> leaving the destination VM in the "paused" state, since the QEMU
> process on the destination side doesn't respond to the QMP command
> "migrate_cancel".


Actually, in our case, QEMU on the destination side doesn't respond to
the QMP command "query-status" instead of "migrate-cancel".
See the details in the mail that was replied to Lukas.

It is my mistake for not checking the comment, I'll fix the comment in
the next version. :(


>
> To fix that, make the main thread yield to the main loop after waiting
> too long for the multi-channels to finish receiving data during one
> iteration. 10 seconds is a sufficient timeout period to set.
>
> Signed-off-by: Hyman Huang <yong.hu...@smartx.com>
> ---
>  migration/multifd.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b255778855..aca0aeb341 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1228,6 +1228,16 @@ void multifd_recv_sync_main(void)
>              }
>          }
>          trace_multifd_recv_sync_main_signal(p->id);
> +        do {
> +            if (qemu_sem_timedwait(&multifd_recv_state->sem_sync, 10000)
> == 0) {
> +                break;
> +            }
> +            if (qemu_in_coroutine()) {
> +                aio_co_schedule(qemu_get_current_aio_context(),
> +                                qemu_coroutine_self());
> +                qemu_coroutine_yield();
> +            }
> +        } while (1);
>          qemu_sem_post(&p->sem_sync);
>      }
>      trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
> --
> 2.27.0
>
> Thanks,
Yong

-- 
Best regards

Reply via email to