On Tue, 10 Mar 2026 23:12:25 +0530 Arun Menon <[email protected]> wrote:
> Hi Lukas, > > On Tue, Mar 10, 2026 at 04:29:54PM +0100, Lukas Straub wrote: > > On Tue, 10 Mar 2026 20:17:57 +0530 > > Arun Menon <[email protected]> wrote: > > > > > Hi Lukas, > > > > > > On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote: > > > > Add a COLO migration test for COLO migration and failover. > > > > > > > > Reviewed-by: Fabiano Rosas <[email protected]> > > > > Tested-by: Fabiano Rosas <[email protected]> > > > > Reviewed-by: Peter Xu <[email protected]> > > > > Signed-off-by: Lukas Straub <[email protected]> > > > > [...] > > > > > > > > > > I was running the qtests locally, and I encountered a timeout error. > > > > > > Command run: mkdir -p build ; cd build ; make check-qtest-x86_64; > > > > > > Following is the output: > > > ====== > > > [...] > > > > > > Summary of Failures: > > > 67/67 qtest+qtest-x86_64 - qemu:qtest-x86_64/migration-test > > > TIMEOUT 480.05s killed by signal 15 SIGTERM > > > Ok: 64 > > > Fail: 0 > > > Skipped: 2 > > > Timeout: 1 > > > ====== > > > > > > It seems that the test runner is stuck waiting for some input. > > > Following is the stack trace > > > [...] > > > ====== > > > gstack 128276 > > > Thread 2 (Thread 0x7fdd090716c0 (LWP 128279) "call_rcu"): > > > #0 0x00007fdd0921434d in syscall () from /lib64/libc.so.6 > > > #1 0x0000557fd604563a in qemu_futex_wait (f=0x557fd60a0190 > > > <rcu_call_ready_event>, val=4294967295) at > > > /home/arun/workdir/new/devel/upstream/qemu-priv/include/qemu/futex.h:47 > > > #2 0x0000557fd604584e in qemu_event_wait (ev=0x557fd60a0190 > > > <rcu_call_ready_event>) at ../util/event.c:162 > > > #3 0x0000557fd6045fde in call_rcu_thread (opaque=0x0) at > > > ../util/rcu.c:304 > > > #4 0x0000557fd600e8fb in qemu_thread_start (args=0x557fd6beec70) at > > > ../util/qemu-thread-posix.c:414 > > > #5 0x00007fdd09193464 in start_thread () from /lib64/libc.so.6 > > > #6 0x00007fdd092165ac in __clone3 () from /lib64/libc.so.6 > > > > > > Thread 1 (Thread 0x7fdd09073240 (LWP 128276) "migration-test"): > > > #0 0x00007fdd0919b982 in __syscall_cancel_arch () from /lib64/libc.so.6 > > > #1 0x00007fdd0918fc3c in __internal_syscall_cancel () from > > > /lib64/libc.so.6 > > > #2 0x00007fdd091dfb62 in clock_nanosleep@GLIBC_2.2.5 () from > > > /lib64/libc.so.6 > > > #3 0x00007fdd091ebb37 in nanosleep () from /lib64/libc.so.6 > > > #4 0x00007fdd0921613a in usleep () from /lib64/libc.so.6 > > > #5 0x0000557fd5fd99cd in wait_for_serial (side=0x557fd6065f08 > > > "dest_serial") at ../tests/qtest/migration/framework.c:82 > > > #6 0x0000557fd5fe5865 in test_colo_common (args=0x557fd6bfdf50, > > > failover_during_checkpoint=false, primary_failover=true) at > > > ../tests/qtest/migration/colo-tests.c:66 > > > > 60 migrate_qmp(from, to, args->connect_uri, NULL, "{}"); > > 61 > > 62 wait_for_migration_status(from, "colo", NULL); > > 63 wait_for_resume(to, get_dst()); > > 64 > > 65 wait_for_serial("src_serial"); > > 66 wait_for_serial("dest_serial"); > > > > Interesting, so the secondary guest is stuck/crahsed after entering > > colo state despite having resumed. > > > > It works fine here on master. And before the merge I have looped the > > colo tests for a whole day on my machine without any failures. > > > > How often does this happen? What is the commit you are on, host, ASAN, > > MSAN, UBSAN, configure options? With kvm or without? > > I have started the run on a fedora-43 VM. I am on commit: > de61484ec39f418e5c0d4603017695f9ffb8fe24 master branch. Okay, so the problem is that nested virtualization slows down kvm page faults (e.g. dirtying the dirty bitmap) so much that it takes longer for the guest to dirty the memory than the interval between two colo checkpoints at which point the dirty bitmap is cleared again. And to make matters worse the secondary qemu process is starved by the primary qemu for some reason. So the following happens: 1. After a checkpoint primary dirties all it's memory pretty quickly and afterwards the dirty mem + serial out loops quickly. So it is very likely the guest is in the middle of dirtying the memory at any given time. 2. The colo checkpointing is actually working flawlessly it only seems stuck because the migraion-test is stuck and not echoing the qmp events. The secondary receives the primary checkpoint (whose guest is in the middle of dirtying memory) and both sides resume. The secondary is starved and spends a lot of time in the dirty mem phase. 3. Another checkpoint happens, before the secondary guest is finished dirtying memory and has a chance sending the serial out signal. Goto 1. Can you try increasing the checkpoint interval the following and see what works for you: migrate_set_parameter_int(from, "x-checkpoint-delay", 300); In my tests, raising it to 1000 worked fine. Though for a doubly nested VM I had to raise it to 5000. E.g. It takes up to 5 seconds to dirty 100Mb memory in a doubly nested VM. Regards, Lukas Straub > > The configuration command I used was: > ../configure --target-list=x86_64-softmmu --enable-debug > --enable-trace-backends=log --enable-slirp > > Targets and accelerators > KVM support : YES > > > > > Can you try the following and show me the log: > > (cd build && QTEST_QEMU_IMG=./qemu-img > > QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_LOG=- > > tests/qtest/migration-test --full -p /x86_64/migration/colo/plain) > > > > I have attached the log file : qemu.log to this email. It is stuck at > the last line : > {"timestamp": {"seconds": 1773164370, "microseconds": 609791}, "event": > "RESUME"} > > It should be some configuration miss from my end. > Thank you for looking into it. > > [...] > > > > > > > > > Is there something that I am missing? Can you please look into this? > > > > > > > > > Regards, > > > Arun Menon > > > > > > > Regards, > Arun Menon >
pgp_85d43T97D.pgp
Description: OpenPGP digital signature
