Prasad Pandit <ppan...@redhat.com> writes: > On Tue, 1 Apr 2025 at 02:24, Fabiano Rosas <faro...@suse.de> wrote: >> The postcopy/multifd/plain test is still hanging from time to time. I >> see a vmstate load function trying to access guest memory and the >> postcopy-listen thread already finished, waiting for that >> qemu_loadvm_state() (frame #18) to return and set the >> main_thread_load_event. >> >> Thread 1 (Thread 0x7fbc4849df80 (LWP 7487) "qemu-system-x86"): >> #0 __memcpy_evex_unaligned_erms () at >> ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:274 >> #1 0x0000560b135103aa in flatview_read_continue_step (attrs=..., >> buf=0x560b168a5930 "U\252\022\006\016\a1\300\271", len=9216, mr_addr=831488, >> l=0x7fbc465ff980, mr=0x560b166c5070) at ../system/physmem.c:3056 >> #2 0x0000560b1351042e in flatview_read_continue (fv=0x560b16c606a0, >> addr=831488, attrs=..., ptr=0x560b168a5930, len=9216, mr_addr=831488, >> l=9216, mr=0x560b166c5070) at ../system/physmem.c:3073 >> #3 0x0000560b13510533 in flatview_read (fv=0x560b16c606a0, addr=831488, >> attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3103 >> #4 0x0000560b135105be in address_space_read_full (as=0x560b14970fc0 >> <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, >> len=9216) at ../system/physmem.c:3116 >> #5 0x0000560b135106e7 in address_space_rw (as=0x560b14970fc0 >> <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, >> len=9216, is_write=false) at ../system/physmem.c:3144 >> #6 0x0000560b13510848 in cpu_physical_memory_rw (addr=831488, >> buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3170 >> #7 0x0000560b1338f5a5 in cpu_physical_memory_read (addr=831488, >> buf=0x560b168a5930, len=9216) at qemu/include/exec/cpu-common.h:148 >> #8 0x0000560b1339063c in patch_hypercalls (s=0x560b168840c0) at >> ../hw/i386/vapic.c:547 >> #9 0x0000560b1339096d in vapic_prepare (s=0x560b168840c0) at >> ../hw/i386/vapic.c:629 >> #10 0x0000560b13390e8b in vapic_post_load (opaque=0x560b168840c0, >> version_id=1) at ../hw/i386/vapic.c:789 >> #11 0x0000560b135b4924 in vmstate_load_state (f=0x560b16c53400, >> vmsd=0x560b147c6cc0 <vmstate_vapic>, opaque=0x560b168840c0, version_id=1) at >> ../migration/vmstate.c:234 >> #12 0x0000560b132a15b8 in vmstate_load (f=0x560b16c53400, se=0x560b16893390) >> at ../migration/savevm.c:972 >> #13 0x0000560b132a4f28 in qemu_loadvm_section_start_full (f=0x560b16c53400, >> type=4 '\004') at ../migration/savevm.c:2746 >> #14 0x0000560b132a5ae8 in qemu_loadvm_state_main (f=0x560b16c53400, >> mis=0x560b16877f20) at ../migration/savevm.c:3058 >> #15 0x0000560b132a45d0 in loadvm_handle_cmd_packaged (mis=0x560b16877f20) at >> ../migration/savevm.c:2451 >> #16 0x0000560b132a4b36 in loadvm_process_command (f=0x560b168c3b60) at >> ../migration/savevm.c:2614 >> #17 0x0000560b132a5b96 in qemu_loadvm_state_main (f=0x560b168c3b60, >> mis=0x560b16877f20) at ../migration/savevm.c:3073 >> #18 0x0000560b132a5db7 in qemu_loadvm_state (f=0x560b168c3b60) at >> ../migration/savevm.c:3150 >> #19 0x0000560b13286271 in process_incoming_migration_co (opaque=0x0) at >> ../migration/migration.c:892 >> #20 0x0000560b137cb6d4 in coroutine_trampoline (i0=377836416, i1=22027) at >> ../util/coroutine-ucontext.c:175 >> #21 0x00007fbc4786a79e in ??? () at >> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:103 >> >> >> Thread 10 (Thread 0x7fffce7fc700 (LWP 11778) "mig/dst/listen"): >> #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 >> #1 0x000055555614e33f in qemu_futex_wait (f=0x5555576f6fc0, val=4294967295) >> at qemu/include/qemu/futex.h:29 >> #2 0x000055555614e505 in qemu_event_wait (ev=0x5555576f6fc0) at >> ../util/qemu-thread-posix.c:464 >> #3 0x0000555555c44eb1 in postcopy_ram_listen_thread (opaque=0x5555576f6f20) >> at ../migration/savevm.c:2135 >> #4 0x000055555614e6b8 in qemu_thread_start (args=0x5555582c8480) at >> ../util/qemu-thread-posix.c:541 >> #5 0x00007ffff72626ea in start_thread (arg=0x7fffce7fc700) at >> pthread_create.c:477 >> #6 0x00007ffff532158f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 >> >> Thread 9 (Thread 0x7fffceffd700 (LWP 11777) "mig/dst/fault"): >> #0 0x00007ffff5314a89 in __GI___poll (fds=0x7fffc0000b60, nfds=2, >> timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 >> #1 0x0000555555c3be3f in postcopy_ram_fault_thread (opaque=0x5555576f6f20) >> at ../migration/postcopy-ram.c:999 >> #2 0x000055555614e6b8 in qemu_thread_start (args=0x555557735be0) at >> ../util/qemu-thread-posix.c:541 >> #3 0x00007ffff72626ea in start_thread (arg=0x7fffceffd700) at >> pthread_create.c:477 >> #4 0x00007ffff532158f in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 >> >> Breaking with gdb and stepping through the memcpy code generates a >> request for a page that's seemingly already in the receivedmap: >> >> (gdb) x/i $pc >> => 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>: rep movsb >> %ds:(%rsi),%es:(%rdi) >> (gdb) p/x $rsi >> $1 = 0x7fffd68cc000 >> (gdb) si >> postcopy_ram_fault_thread_request Request for HVA=0x7fffd68cc000 rb=pc.ram >> offset=0xcc000 pid=11754 >> // these are my printfs: >> postcopy_request_page: >> migrate_send_rp_req_pages: >> migrate_send_rp_req_pages: mutex >> migrate_send_rp_req_pages: received >> >> // gdb hangs here, it looks like the page wasn't populated? >> >> I've had my share of postcopy for the day. Hopefully you'll be able to >> figure out what the issue is. >> >> - reproducer (2nd iter already hangs for me): >> >> $ for i in $(seq 1 9999); do echo "$i ============="; \ >> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \ >> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done >> >> - reproducer with traces and gdb: >> >> $ for i in $(seq 1 9999); do echo "$i ============="; \ >> QTEST_TRACE="multifd_* -trace source_* -trace postcopy_* -trace savevm_* \ >> -trace loadvm_*" QTEST_QEMU_BINARY_DST='gdb --ex "handle SIGUSR1 \ >> noprint" --ex "run" --args ./qemu-system-x86_64' \ >> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \ >> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done > > * Thank you for the reproducer and traces. I'll try to check more and > see if I'm able to reproduce it on my side. >
Thanks. I cannot merge this series until that issue is resolved. If it reproduces on my machine there's a high chance that it will break CI at some point and then it'll be a nightmare to debug. This has happened many times before with multifd. > Thank you. > --- > - Prasad