On Tue, Apr 29, 2025 at 08:50:19PM +0530, Prasad Pandit wrote: > On Tue, 29 Apr 2025 at 19:18, Peter Xu <pet...@redhat.com> wrote: > > Please don't rush to send. Again, let's verify the issue first before > > resending anything. > > > > If you could reproduce it it would be perfect, then we can already verify > > it. Otherwise we may need help from Fabiano. Let's not send anything if > > you're not yet sure whether it works.. It can confuse people thinking > > problem solved, but maybe not yet. > > * No, the migration hang issue is not reproducing on my side. Earlier > in this thread, Fabiano said you'll be better able to confirm the > issue. (so its possible fix as well I guess) > > * You don't have access to the set-up that he uses for running tests > and merging patches? Would it be possible for you to run the same > tests? (just checking, I don't know how co-maintainers work to > test/merge patches)
No I don't. > > * If we don't send the patch, how will Fabiano test it? Should we wait > for Fabiano to come back and then make this same patch in his set-up > and test/verify it? I thought you've provided a diff. That would be good enough for verifications. If you really want, you can repost, but please mention explicitly that you haven't verified the issue, so the patchset needs to be verified. Fabiano should come back early May. If you want, you can try to look into how to reproduce it by looking at why it triggered in vapic path: https://lore.kernel.org/all/87plhwgbu6....@suse.de/#t Thread 1 (Thread 0x7fbc4849df80 (LWP 7487) "qemu-system-x86"): #0 __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:274 #1 0x0000560b135103aa in flatview_read_continue_step (attrs=..., buf=0x560b168a5930 "U\252\022\006\016\a1\300\271", len=9216, mr_addr=831488, l=0x7fbc465ff980, mr=0x560b166c5070) at ../system/physmem.c:3056 #2 0x0000560b1351042e in flatview_read_continue (fv=0x560b16c606a0, addr=831488, attrs=..., ptr=0x560b168a5930, len=9216, mr_addr=831488, l=9216, mr=0x560b166c5070) at ../system/physmem.c:3073 #3 0x0000560b13510533 in flatview_read (fv=0x560b16c606a0, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3103 #4 0x0000560b135105be in address_space_read_full (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3116 #5 0x0000560b135106e7 in address_space_rw (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3144 #6 0x0000560b13510848 in cpu_physical_memory_rw (addr=831488, buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3170 #7 0x0000560b1338f5a5 in cpu_physical_memory_read (addr=831488, buf=0x560b168a5930, len=9216) at qemu/include/exec/cpu-common.h:148 #8 0x0000560b1339063c in patch_hypercalls (s=0x560b168840c0) at ../hw/i386/vapic.c:547 #9 0x0000560b1339096d in vapic_prepare (s=0x560b168840c0) at ../hw/i386/vapic.c:629 #10 0x0000560b13390e8b in vapic_post_load (opaque=0x560b168840c0, version_id=1) at ../hw/i386/vapic.c:789 #11 0x0000560b135b4924 in vmstate_load_state (f=0x560b16c53400, vmsd=0x560b147c6cc0 <vmstate_vapic>, opaque=0x560b168840c0, version_id=1) at ../migration/vmstate.c:234 #12 0x0000560b132a15b8 in vmstate_load (f=0x560b16c53400, se=0x560b16893390) at ../migration/savevm.c:972 #13 0x0000560b132a4f28 in qemu_loadvm_section_start_full (f=0x560b16c53400, type=4 '\004') at ../migration/savevm.c:2746 #14 0x0000560b132a5ae8 in qemu_loadvm_state_main (f=0x560b16c53400, mis=0x560b16877f20) at ../migration/savevm.c:3058 #15 0x0000560b132a45d0 in loadvm_handle_cmd_packaged (mis=0x560b16877f20) at ../migration/savevm.c:2451 #16 0x0000560b132a4b36 in loadvm_process_command (f=0x560b168c3b60) at ../migration/savevm.c:2614 #17 0x0000560b132a5b96 in qemu_loadvm_state_main (f=0x560b168c3b60, mis=0x560b16877f20) at ../migration/savevm.c:3073 #18 0x0000560b132a5db7 in qemu_loadvm_state (f=0x560b168c3b60) at ../migration/savevm.c:3150 #19 0x0000560b13286271 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:892 #20 0x0000560b137cb6d4 in coroutine_trampoline (i0=377836416, i1=22027) at ../util/coroutine-ucontext.c:175 #21 0x00007fbc4786a79e in ??? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:103 So _if_ the theory is correct, vapic's patch_hypercalls() might be reading a zero page (with GPA 831488, over len=9216, which IIUC covers three pages). Maybe you can check when it'll be one zero page and when it will be not, then maybe you can figure out how you make it always a zero page hence reliably trigger a hang in post_load. You could also try to write a program in guest, zeroing most pages first, trigger migrate (hence send zero pages during multifd precopy), start postcopy, then you should be able to observe vcpu hang at least before postcopy completes. However I don't think it'll hang forever, since if migration all completes, UFFDIO_UNREGISTER will remove the userfaultfd trackings and then kick all hang threads out, causing the fault to be resolved right at the completion of postcopy. So it won't really hang forever like what Fabiano reported here. Meanwhile we'll always want to verify the original reproducer.. even if you could hang it temporarily in a vcpu thread. Thanks, -- Peter Xu