Peter Xu <pet...@redhat.com> writes: > On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: >> Fabiano Rosas <faro...@suse.de> writes: >> >> > Peter Xu <pet...@redhat.com> writes: >> > >> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote: >> >>> > I don't think we're discussing total CI time at this point, so the math >> >>> > doesn't really add up. We're not looking into making the CI finish >> >>> > faster. We're looking into making migration-test finish faster. That >> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance >> >>> > of random race conditions* affecting other people/staging runs. >> >>> >> >>> Right. The reason migration-test appears on my radar is because >> >>> it is very frequently the thing that shows up as "this sometimes >> >>> just fails or just times out and if you hit retry it goes away >> >>> again". That might not be migration-test's fault specifically, >> >>> because those retries tend to be certain CI configs (s390, >> >>> the i686-tci one), and I have some theories about what might be >> >>> causing it (e.g. build system runs 4 migration-tests in parallel, >> >>> which means 8 QEMU processes which is too many for the number >> >>> of host CPUs). But right now I look at CI job failures and my reaction >> >>> is "oh, it's the migration-test failing yet again" :-( >> >>> >> >>> For some examples from this week: >> >>> >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >> >> >> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in >> >> parallel. It indeed sounds like no good way to finally solve.. I don't >> >> also see how speeding up / reducing tests in migration test would help, as >> >> that's (from some degree..) is the same as tuning the timeout value >> >> bigger. >> >> When the tests are less it'll fit into 480s window, but maybe it's too >> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it >> >> can timeout again when on a busy host with less capability of concurrency. >> >> >> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info >> >> here before the log expires: >> >> >> >> =================================8<================================ >> >> >> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host >> >> >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >> >> >> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test >> >> ERROR 144.32s killed by signal 6 SIGABRT >> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon >> >>>>> G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh >> >>>>> >> >>>>> PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 >> >>>>> QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 >> >>>>> QTEST_QEMU_BINARY=./qemu-system-i386 >> >>>>> /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test >> >>>>> --tap -k >> >> ――――――――――――――――――――――――――――――――――――― ✀ >> >> ――――――――――――――――――――――――――――――――――――― >> >> stderr: >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from >> >> signal 11 (Segmentation fault) (core dumped) >> >> (test program exited with status code -6) >> >> TAP parsing error: Too few tests run (expected 53, got 39) >> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> >> >> # Start of plain tests >> >> # Running /i386/migration/multifd/tcp/plain/cancel >> >> # Using machine type: pc-i440fx-9.2 >> >> # starting QEMU: exec ./qemu-system-i386 -qtest >> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev >> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon >> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel >> >> tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial >> >> file:/tmp/migration-test-4112T2/src_serial -drive >> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device >> >> ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >> # starting QEMU: exec ./qemu-system-i386 -qtest >> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev >> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon >> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel >> >> tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial >> >> file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive >> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device >> >> ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >> ----------------------------------- stderr >> >> ----------------------------------- >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from >> >> signal 11 (Segmentation fault) (core dumped) >> >> >> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host >> >> >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >> >> >> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test >> >> ERROR 381.00s killed by signal 6 SIGABRT >> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 >> >>>>> QTEST_QEMU_IMG=./qemu-img >> >>>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh >> >>>>> QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 >> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon >> >>>>> /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> >> ――――――――――――――――――――――――――――――――――――― ✀ >> >> ――――――――――――――――――――――――――――――――――――― >> >> stderr: >> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was >> >> non-properly terminated. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from >> >> signal 11 (Segmentation fault) (core dumped) >> >> (test program exited with status code -6) >> >> TAP parsing error: Too few tests run (expected 61, got 47) >> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> >> >> # Start of plain tests >> >> # Running /ppc64/migration/multifd/tcp/plain/cancel >> >> # Using machine type: pseries-9.2 >> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest >> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev >> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control >> >> -display none -audio none -accel kvm -accel tcg -machine >> >> pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial >> >> file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine >> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios >> >> /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest >> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev >> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control >> >> -display none -audio none -accel kvm -accel tcg -machine >> >> pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial >> >> file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults >> >> -machine >> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios >> >> /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >> ----------------------------------- stderr >> >> ----------------------------------- >> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was >> >> non-properly terminated. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from >> >> signal 11 (Segmentation fault) (core dumped) >> >> >> >> (test program exited with status code -6) >> >> =================================8<================================ >> >> >> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on >> >> different host / arch being tested. What is more weird is the two >> >> failures >> >> are different, the 2nd failure throw out a TLS error even though the test >> >> doesn't yet have tls involved. >> > >> > I think that's just a parallel test being cancelled prematurely, either >> > due to the crash or due to the timeout. >> > >> >> >> >> Fabiano, is this the issue you're looking at? >> > >> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop >> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It >> > takes ~1h to hit. I've seen crashes with ppc64, s390 and >> > aarch64. >> > >> >> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve >> zero page causing multiple page faults"), the multifd code started using >> the rb->receivedmap bitmap, which belongs to the ram code and is >> initialized and *freed* from the ram SaveVMHandlers. >> >> process_incoming_migration_co() ... >> qemu_loadvm_state() multifd_nocomp_recv() >> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> rb->receivedmap = NULL set_bit_atomic(..., >> rb->receivedmap) >> ... >> migration_incoming_state_destroy() >> multifd_recv_cleanup() >> multifd_recv_terminate_threads(NULL) >> >> Multifd threads are live until migration_incoming_state_destroy(), which >> is called some time later. > > Thanks for the debugging. Hmm I would expect loadvm should wait until all > ram is received somehow..
Looks like a similar issue as when we didn't have the multifd_send sync working correctly and ram code would run and do cleanup. > > And when looking, I also found we're ambiguous on the cleanups > now.. probably a separate issue that not yet cause crashes. I meant we > even don't have a consistent order to clean these in precopy / postcopy, > as: > > We have this: > > qemu_loadvm_state > qemu_loadvm_state_cleanup <------------- [1] > migration_bh_schedule(process_incoming_migration_bh, mis); > (then in bh...) > process_incoming_migration_bh > migration_incoming_state_destroy <------------- [2] > > But then: > > postcopy_ram_listen_thread > migration_incoming_state_destroy <------------- [2] > qemu_loadvm_state_cleanup <------------- [1] > > I think it makes more sense to do [2] after [1], so maybe postcopy needs > changing too.. Yes, ram_load_cleanup doesn't do much: static int ram_load_cleanup(void *opaque) { RAMBlock *rb; RAMBLOCK_FOREACH_NOT_IGNORED(rb) { qemu_ram_block_writeback(rb); } xbzrle_load_cleanup(); return 0; } We can probably change the order just fine. For the bug I'm now trying to add a helper to init/clean the receivedmap and only call it from code that uses it (postcopy, multifd_recv), instead of always init/clean along with ram_load_setup/cleanup. > >> >> >> Peter, do you think it'll be helpful if we temporarily mark this test as >> >> "slow" too so it's not run in CI (so we still run it ourselves when >> >> prepare >> >> migration PR, with the hope that it can reproduce)? >> >> >> >> Thanks, >>