Prasad Pandit <ppan...@redhat.com> writes:

> Hello Fabiano,
>
> On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <faro...@suse.de> wrote:
>> Prasad Pandit <ppan...@redhat.com> writes:
>> > * Thank you for the reproducer and traces. I'll try to check more and
>> > see if I'm able to reproduce it on my side.
>>
>> Thanks. I cannot merge this series until that issue is resolved. If it
>> reproduces on my machine there's a high chance that it will break CI at
>> some point and then it'll be a nightmare to debug. This has happened
>> many times before with multifd.
>
> ===
> qemu/build)$ for i in $(seq 1 9999); do echo "$i ====";
> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test
> --full -r '/x86_64/migration/postcopy/multifd/plain' || break; done |
> tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) };
> /slow test/ { printf("%s\n", $_); }'
>
> Host-1]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.56 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.43 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.46 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.58 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> --------
> Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs
>
>
> Host-2]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.69 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.37 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> --------
> Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs
>
>
> Host-3]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.55 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> --------
> Iter: 9999, low: 1.31, high: 2.49, avg: 1.48
> ===
>
> * I tried to reproduce the hang issue with and without -traces across
> 3 different machines but am unable to reproduce it on my side.
>
> * Going through the source and the back trace you provided, you said
> gdb hangs in the postcopy_ram_fault_thread() function at poll()
> function to wait for a missing page.
>    - But by this time, postcopy_ram_listen thread is already preparing
> to cleanup and exit
>        - That means postcopy migration is finished/ending
>    - ie. postcopy migration is ending without (or before) migrating
> all the RAM pages from the source side?
>

That's what it looks like. It could be some error condition that is not
being propagated properly. The thread hits an error and exits without
informing the rest of migration.

> In postcopy mode:
>     * Is there a way to log the pages (#numers) that are sent from the
> source side?
>     * And log the pages (#numbers) that are received on the receive side?
>
> * That way we might be able to check/confirm the pages which were not
> received or not processed properly.

Some combination of the postcopy traces should give you that. Sorry,
Peter Xu really is the expert on postcopy, I just tag along.

>
> * Can we connect the faulting/missing (HVA=0x7fffd68cc000)
> address/page in postcopy_ram_fault_thread() with the memcpy that the
> main thread seems to be loading via vapic_post_load()? ie. the main
> thread and pocyopy_ram_fault_thread() above could be doing unrelated
> things.
>

The snippet I posted shows that it's the same page:

(gdb) x/i $pc
=> 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>:    rep movsb 
%ds:(%rsi),%es:(%rdi)
(gdb) p/x $rsi
$1 = 0x7fffd68cc000

> * Other than this, I've revised the patch-set as suggested. How do we
> proceed further?

Send your next version and I'll set some time aside to debug this.

heads-up: I'll be off from 2025/04/18 until 2025/05/05. Peter should be
already back in the meantime.

>
> Thank you.
> ---
>   - Prasad

Reply via email to