On Tue, 10 Mar 2026 23:12:25 +0530
Arun Menon <[email protected]> wrote:

> Hi Lukas,
> 
> On Tue, Mar 10, 2026 at 04:29:54PM +0100, Lukas Straub wrote:
> > On Tue, 10 Mar 2026 20:17:57 +0530
> > Arun Menon <[email protected]> wrote:
> >   
> > > Hi Lukas,
> > > 
> > > On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:  
> > > > Add a COLO migration test for COLO migration and failover.
> > > > 
> > > > Reviewed-by: Fabiano Rosas <[email protected]>
> > > > Tested-by: Fabiano Rosas <[email protected]>
> > > > Reviewed-by: Peter Xu <[email protected]>
> > > > Signed-off-by: Lukas Straub <[email protected]>
> > > > [...]
> > > >    
> > > 
> > > I was running the qtests locally, and I encountered a timeout error.
> > > 
> > > Command run: mkdir -p build ; cd build ; make check-qtest-x86_64;
> > > 
> > > Following is the output:
> > > ======
> > > [...]
> > > 
> > > Summary of Failures:
> > > 67/67 qtest+qtest-x86_64 - qemu:qtest-x86_64/migration-test         
> > > TIMEOUT        480.05s   killed by signal 15 SIGTERM
> > > Ok:                64
> > > Fail:              0
> > > Skipped:           2
> > > Timeout:           1
> > > ======
> > > 
> > > It seems that the test runner is stuck waiting for some input.
> > > Following is the stack trace
> > > [...]
> > > ======
> > > gstack 128276
> > > Thread 2 (Thread 0x7fdd090716c0 (LWP 128279) "call_rcu"):
> > > #0  0x00007fdd0921434d in syscall () from /lib64/libc.so.6
> > > #1  0x0000557fd604563a in qemu_futex_wait (f=0x557fd60a0190 
> > > <rcu_call_ready_event>, val=4294967295) at 
> > > /home/arun/workdir/new/devel/upstream/qemu-priv/include/qemu/futex.h:47
> > > #2  0x0000557fd604584e in qemu_event_wait (ev=0x557fd60a0190 
> > > <rcu_call_ready_event>) at ../util/event.c:162
> > > #3  0x0000557fd6045fde in call_rcu_thread (opaque=0x0) at 
> > > ../util/rcu.c:304
> > > #4  0x0000557fd600e8fb in qemu_thread_start (args=0x557fd6beec70) at 
> > > ../util/qemu-thread-posix.c:414
> > > #5  0x00007fdd09193464 in start_thread () from /lib64/libc.so.6
> > > #6  0x00007fdd092165ac in __clone3 () from /lib64/libc.so.6
> > > 
> > > Thread 1 (Thread 0x7fdd09073240 (LWP 128276) "migration-test"):
> > > #0  0x00007fdd0919b982 in __syscall_cancel_arch () from /lib64/libc.so.6
> > > #1  0x00007fdd0918fc3c in __internal_syscall_cancel () from 
> > > /lib64/libc.so.6
> > > #2  0x00007fdd091dfb62 in clock_nanosleep@GLIBC_2.2.5 () from 
> > > /lib64/libc.so.6
> > > #3  0x00007fdd091ebb37 in nanosleep () from /lib64/libc.so.6
> > > #4  0x00007fdd0921613a in usleep () from /lib64/libc.so.6
> > > #5  0x0000557fd5fd99cd in wait_for_serial (side=0x557fd6065f08 
> > > "dest_serial") at ../tests/qtest/migration/framework.c:82
> > > #6  0x0000557fd5fe5865 in test_colo_common (args=0x557fd6bfdf50, 
> > > failover_during_checkpoint=false, primary_failover=true) at 
> > > ../tests/qtest/migration/colo-tests.c:66  
> > 
> > 60    migrate_qmp(from, to, args->connect_uri, NULL, "{}");
> > 61
> > 62    wait_for_migration_status(from, "colo", NULL);
> > 63    wait_for_resume(to, get_dst());
> > 64
> > 65    wait_for_serial("src_serial");
> > 66    wait_for_serial("dest_serial");
> > 
> > Interesting, so the secondary guest is stuck/crahsed after entering
> > colo state despite having resumed.
> > 
> > It works fine here on master. And before the merge I have looped the
> > colo tests for a whole day on my machine without any failures.
> > 
> > How often does this happen? What is the commit you are on, host, ASAN,
> > MSAN, UBSAN, configure options? With kvm or without?  
> 
> I have started the run on a fedora-43 VM. I am on commit:
> de61484ec39f418e5c0d4603017695f9ffb8fe24 master branch.

Okay, so the problem is that nested virtualization slows down kvm page
faults (e.g. dirtying the dirty bitmap) so much that it takes longer
for the guest to dirty the memory than the interval between two colo
checkpoints at which point the dirty bitmap is cleared again. And to
make matters worse the secondary qemu process is starved by the primary
qemu for some reason.

So the following happens:
1. After a checkpoint primary dirties all it's memory pretty quickly
and afterwards the dirty mem + serial out loops quickly. So it is very
likely the guest is in the middle of dirtying the memory at any given
time.
2. The colo checkpointing is actually working flawlessly it only seems
stuck because the migraion-test is stuck and not echoing the qmp events.
The secondary receives the primary checkpoint (whose guest is in the
middle of dirtying memory) and both sides resume. The secondary is
starved and spends a lot of time in the dirty mem phase.
3. Another checkpoint happens, before the secondary guest is finished
dirtying memory and has a chance sending the serial out signal. Goto 1.

Can you try increasing the checkpoint interval the following and see
what works for you:
migrate_set_parameter_int(from, "x-checkpoint-delay", 300);

In my tests, raising it to 1000 worked fine. Though for a doubly nested VM
I had to raise it to 5000. E.g. It takes up to 5 seconds to dirty 100Mb
memory in a doubly nested VM.

Regards,
Lukas Straub

> 
> The configuration command I used was:
> ../configure --target-list=x86_64-softmmu --enable-debug 
> --enable-trace-backends=log --enable-slirp
> 
> Targets and accelerators
>   KVM support                     : YES
> 
> > 
> > Can you try the following and show me the log:
> > (cd build && QTEST_QEMU_IMG=./qemu-img 
> > QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_LOG=- 
> > tests/qtest/migration-test --full -p /x86_64/migration/colo/plain)
> >   
> 
> I have attached the log file : qemu.log to this email. It is stuck at
> the last line : 
> {"timestamp": {"seconds": 1773164370, "microseconds": 609791}, "event": 
> "RESUME"}
> 
> It should be some configuration miss from my end.
> Thank you for looking into it.
> 
> [...]
> > > 
> > > 
> > > Is there something that I am missing? Can you please look into this? 
> > > 
> > > 
> > > Regards,
> > > Arun Menon
> > >   
> >   
> 
> Regards,
> Arun Menon
> 

Attachment: pgp_85d43T97D.pgp
Description: OpenPGP digital signature

Reply via email to