Hi,

On Fri, 11 Apr 2025 at 01:48, Fabiano Rosas <faro...@suse.de> wrote:
> That's what it looks like. It could be some error condition that is not
> being propagated properly. The thread hits an error and exits without
> informing the rest of migration.

* The gdb(1) hanging in the postcopy_ram_fault_thread() is not
conclusive. I tried to set following break-points

    gdb) break postcopy-ram.c:998 - poll_result = poll(pfd, pfd_len,
-1 /* Wait forever */);
    gdb) break postcopy-ram.c:1057 -  rb = qemu_ram_block_from_host(...);

  gdb(1) hangs for both of them, there might be another reason for it.
Live-migration also stalls with it.

> Some combination of the postcopy traces should give you that. Sorry,
> Peter Xu really is the expert on postcopy, I just tag along.

* I see. Maybe it could be logged with --migration-debug=<level> option.

> The snippet I posted shows that it's the same page:
>
> (gdb) x/i $pc
> => 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>:    rep movsb 
> %ds:(%rsi),%es:(%rdi)
> (gdb) p/x $rsi
> $1 = 0x7fffd68cc000
>
===
>> Thread 1 (Thread 0x7fbc4849df80 (LWP 7487) "qemu-system-x86"):
...
>> Thread 10 (Thread 0x7fffce7fc700 (LWP 11778) "mig/dst/listen"):
...
>> Thread 9 (Thread 0x7fffceffd700 (LWP 11777) "mig/dst/fault"):
#0  0x00007ffff5314a89 in __GI___poll (fds=0x7fffc0000b60, nfds=2,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
...
postcopy_ram_fault_thread_request Request for HVA=0x7fffd68cc000
rb=pc.ram offset=0xcc000 pid=11754
===

* Looking at the above data, it seems the missing page fault occurred
in thread=11754 , it may not be the memcpy(3) in
thread-1(pid/tid=7487) that triggered the fault.

* Secondly, if 'mig/dst/fault' thread is waiting at poll(2) call, ie.
fault notification has not arrived on the mis->userfault_fd  OR
mis->userfault_event_fd descriptors yet.  So the "Request for
HVA=0x7fffd..." via postcopy_ram_fault_thread_request() could be an
already served request.


> Send your next version and I'll set some time aside to debug this.
>
> heads-up: I'll be off from 2025/04/18 until 2025/05/05. Peter should be
> already back in the meantime.

* Okay, I'll send the next version.

Thank you.
---
  - Prasad


Reply via email to