Prasad Pandit <ppan...@redhat.com> writes: > Hello Fabiano, > > On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <faro...@suse.de> wrote: >> Prasad Pandit <ppan...@redhat.com> writes: >> > * Thank you for the reproducer and traces. I'll try to check more and >> > see if I'm able to reproduce it on my side. >> >> Thanks. I cannot merge this series until that issue is resolved. If it >> reproduces on my machine there's a high chance that it will break CI at >> some point and then it'll be a nightmare to debug. This has happened >> many times before with multifd. > > === > qemu/build)$ for i in $(seq 1 9999); do echo "$i ===="; > QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test > --full -r '/x86_64/migration/postcopy/multifd/plain' || break; done | > tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) }; > /slow test/ { printf("%s\n", $_); }' > > Host-1] > ... > 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.42 secs > 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.56 secs > 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.44 secs > 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.43 secs > 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.53 secs > 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.46 secs > 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.49 secs > 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.41 secs > 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.42 secs > 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.58 secs > 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > -------- > Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs > > > Host-2] > ... > 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.69 secs > 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.41 secs > 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.54 secs > 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.44 secs > 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.44 secs > 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.37 secs > 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.45 secs > 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.53 secs > 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.47 secs > 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > -------- > Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs > > > Host-3] > ... > 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.50 secs > 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.55 secs > 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.54 secs > 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.49 secs > 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.49 secs > 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.52 secs > 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.52 secs > 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.54 secs > 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.51 secs > 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.50 secs > 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.53 secs > 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.50 secs > 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.53 secs > 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.49 secs > 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.48 secs > 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.54 secs > 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.44 secs > 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain > executed in 1.54 secs > -------- > Iter: 9999, low: 1.31, high: 2.49, avg: 1.48 > === > > * I tried to reproduce the hang issue with and without -traces across > 3 different machines but am unable to reproduce it on my side. > > * Going through the source and the back trace you provided, you said > gdb hangs in the postcopy_ram_fault_thread() function at poll() > function to wait for a missing page. > - But by this time, postcopy_ram_listen thread is already preparing > to cleanup and exit > - That means postcopy migration is finished/ending > - ie. postcopy migration is ending without (or before) migrating > all the RAM pages from the source side? >
That's what it looks like. It could be some error condition that is not being propagated properly. The thread hits an error and exits without informing the rest of migration. > In postcopy mode: > * Is there a way to log the pages (#numers) that are sent from the > source side? > * And log the pages (#numbers) that are received on the receive side? > > * That way we might be able to check/confirm the pages which were not > received or not processed properly. Some combination of the postcopy traces should give you that. Sorry, Peter Xu really is the expert on postcopy, I just tag along. > > * Can we connect the faulting/missing (HVA=0x7fffd68cc000) > address/page in postcopy_ram_fault_thread() with the memcpy that the > main thread seems to be loading via vapic_post_load()? ie. the main > thread and pocyopy_ram_fault_thread() above could be doing unrelated > things. > The snippet I posted shows that it's the same page: (gdb) x/i $pc => 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>: rep movsb %ds:(%rsi),%es:(%rdi) (gdb) p/x $rsi $1 = 0x7fffd68cc000 > * Other than this, I've revised the patch-set as suggested. How do we > proceed further? Send your next version and I'll set some time aside to debug this. heads-up: I'll be off from 2025/04/18 until 2025/05/05. Peter should be already back in the meantime. > > Thank you. > --- > - Prasad