On Thu, Aug 7, 2025 at 11:04 AM <yong.hu...@smartx.com> wrote: > From: Hyman Huang <yong.hu...@smartx.com> > > When there are network issues like missing TCP ACKs on the send > side during the multifd live migration. At the send side, the error > "Connection timed out" is thrown out and source QEMU process stop > sending data, at the receive side, The IO-channels may be blocked > at recvmsg() and thus the main loop gets stuck and fails to respond > to QMP commands consequently. > > The QEMU backtrace at the receive side with the main thread and two > multi-channel threads is displayed as follows: > > multifd thread 2: > Thread 10 (Thread 0x7fd24d5fd700 (LWP 1413634)): > 0 0x00007fd46066d157 in __libc_recvmsg (fd=46, msg=msg@entry=0x7fd24d5fc530, > flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28 > 1 0x00005556d52ffb1b in qio_channel_socket_readv (ioc=<optimized out>, > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, > flags=<optimized out>, errp=0x7fd24d5fc6f8) at ../io/channel-socket.c:513 > 2 0x00005556d530561f in qio_channel_readv_full_all_eof > (ioc=0x5556d76db290, iov=<optimized out>, niov=<optimized out>, fds=0x0, > nfds=0x0, errp=errp@entry=0x7fd24d5fc6f8) at ../io/channel.c:142 > 3 0x00005556d53057d9 in qio_channel_readv_full_all (ioc=<optimized out>, > iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, > nfds=<optimized out>, errp=0x7fd24d5fc6f8) at ../io/channel.c:210 > 4 0x00005556d4fa4fc9 in multifd_recv_thread > (opaque=opaque@entry=0x5556d7affa60) > at ../migration/multifd.c:1113 > 5 0x00005556d5414826 in qemu_thread_start (args=<optimized out>) at > ../util/qemu-thread-posix.c:556 > 6 0x00007fd460662f1b in start_thread (arg=0x7fd24d5fd700) at > pthread_create.c:486 > 7 0x00007fd46059a1a0 in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:98 > > multifd thread 1: > Thread 9 (Thread 0x7fd24ddfe700 (LWP 1413633)): > 0 0x00007fd46066d157 in __libc_recvmsg (fd=44, msg=msg@entry=0x7fd24ddfd530, > flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28 > 1 0x00005556d52ffb1b in qio_channel_socket_readv (ioc=<optimized out>, > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, > flags=<optimized out>, errp=0x7fd24ddfd6f8) at ../io/channel-socket.c:513 > 2 0x00005556d530561f in qio_channel_readv_full_all_eof > (ioc=0x5556d76dc600, iov=<optimized out>, niov=<optimized out>, fds=0x0, > nfds=0x0, errp=errp@entry=0x7fd24ddfd6f8) at ../io/channel.c:142 > 3 0x00005556d53057d9 in qio_channel_readv_full_all (ioc=<optimized out>, > iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, > nfds=<optimized out>, errp=0x7fd24ddfd6f8) at ../io/channel.c:210 > 4 0x00005556d4fa4fc9 in multifd_recv_thread > (opaque=opaque@entry=0x5556d7aff990) > at ../migration/multifd.c:1113 > 5 0x00005556d5414826 in qemu_thread_start (args=<optimized out>) at > ../util/qemu-thread-posix.c:556 > 6 0x00007fd460662f1b in start_thread (arg=0x7fd24ddfe700) at > pthread_create.c:486 > 7 0x00007fd46059a1a0 in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:98 > > main thread: > Thread 1 (Thread 0x7fd45f1fbe40 (LWP 1413088)): > 0 0x00007fd46066b616 in futex_abstimed_wait_cancelable (private=0, > abstime=0x0, clockid=0, expected=0, futex_word=0x5556d7604e80) at > ../sysdeps/unix/sysv/linux/futex-internal.h:216 > 1 do_futex_wait (sem=sem@entry=0x5556d7604e80, abstime=0x0) at > sem_waitcommon.c:111 > 2 0x00007fd46066b708 in __new_sem_wait_slow (sem=sem@entry=0x5556d7604e80, > abstime=0x0) at sem_waitcommon.c:183 > 3 0x00007fd46066b779 in __new_sem_wait (sem=sem@entry=0x5556d7604e80) at > sem_wait.c:42 > 4 0x00005556d5415524 in qemu_sem_wait (sem=0x5556d7604e80) at > ../util/qemu-thread-posix.c:358 > 5 0x00005556d4fa5e99 in multifd_recv_sync_main () at > ../migration/multifd.c:1052 > 6 0x00005556d521ed65 in ram_load_precopy (f=f@entry=0x5556d75dfb90) at > ../migration/ram.c:4446 > 7 0x00005556d521f1dd in ram_load (f=0x5556d75dfb90, opaque=<optimized > out>, version_id=4) at ../migration/ram.c:4495 > 8 0x00005556d4faa3e7 in vmstate_load (f=f@entry=0x5556d75dfb90, > se=se@entry=0x5556d6083070) at ../migration/savevm.c:909 > 9 0x00005556d4fae7a0 in qemu_loadvm_section_part_end (mis=0x5556d6082cc0, > f=0x5556d75dfb90) at ../migration/savevm.c:2475 > 10 qemu_loadvm_state_main (f=f@entry=0x5556d75dfb90, > mis=mis@entry=0x5556d6082cc0) > at ../migration/savevm.c:2634 > 11 0x00005556d4fafbd5 in qemu_loadvm_state (f=0x5556d75dfb90) at > ../migration/savevm.c:2706 > 12 0x00005556d4f9ebdb in process_incoming_migration_co (opaque=<optimized > out>) at ../migration/migration.c:561 > 13 0x00005556d542513b in coroutine_trampoline (i0=<optimized out>, > i1=<optimized out>) at ../util/coroutine-ucontext.c:186 > 14 0x00007fd4604ef970 in ?? () from target:/lib64/libc.so.6 > > Once the QEMU process falls into the above state in the presence of > the network errors, live migration cannot be canceled gracefully, > leaving the destination VM in the "paused" state, since the QEMU > process on the destination side doesn't respond to the QMP command > "migrate_cancel".
Actually, in our case, QEMU on the destination side doesn't respond to the QMP command "query-status" instead of "migrate-cancel". See the details in the mail that was replied to Lukas. It is my mistake for not checking the comment, I'll fix the comment in the next version. :( > > To fix that, make the main thread yield to the main loop after waiting > too long for the multi-channels to finish receiving data during one > iteration. 10 seconds is a sufficient timeout period to set. > > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> > --- > migration/multifd.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/migration/multifd.c b/migration/multifd.c > index b255778855..aca0aeb341 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -1228,6 +1228,16 @@ void multifd_recv_sync_main(void) > } > } > trace_multifd_recv_sync_main_signal(p->id); > + do { > + if (qemu_sem_timedwait(&multifd_recv_state->sem_sync, 10000) > == 0) { > + break; > + } > + if (qemu_in_coroutine()) { > + aio_co_schedule(qemu_get_current_aio_context(), > + qemu_coroutine_self()); > + qemu_coroutine_yield(); > + } > + } while (1); > qemu_sem_post(&p->sem_sync); > } > trace_multifd_recv_sync_main(multifd_recv_state->packet_num); > -- > 2.27.0 > > Thanks, Yong -- Best regards