Peter Xu <pet...@redhat.com> writes: > On Thu, Oct 05, 2023 at 06:10:20PM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Thu, Oct 05, 2023 at 10:37:56AM -0300, Fabiano Rosas wrote: >> >> >> + /* >> >> >> + * Make sure both QEMU instances will go into RECOVER stage, then >> >> >> test >> >> >> + * kicking them out using migrate-pause. >> >> >> + */ >> >> >> + wait_for_postcopy_status(from, "postcopy-recover"); >> >> >> + wait_for_postcopy_status(to, "postcopy-recover"); >> >> > >> >> > Is this wait out of place? I think we're trying to resume too fast after >> >> > migrate_recover(): >> >> > >> >> > # { >> >> > # "error": { >> >> > # "class": "GenericError", >> >> > # "desc": "Cannot resume if there is no paused migration" >> >> > # } >> >> > # } >> >> > >> >> >> >> Ugh, sorry about the long lines: >> >> >> >> { >> >> "error": { >> >> "class": "GenericError", >> >> "desc": "Cannot resume if there is no paused migration" >> >> } >> >> } >> > >> > Sorry I didn't get you here. Could you elaborate your question? >> > >> >> The test is sometimes failing with the above message. >> >> But indeed my question doesn't make sense. I forgot migrate_recover >> happens on the destination. Nevermind. >> >> The bug is still present nonetheless. We're going into migrate_prepare >> in some state other than POSTCOPY_PAUSED. > > Oh I see. Interestingly I cannot reproduce on my host, just like last > time.. > > What is your setup for running the test? Anything special? Here's my > cmdline:
The crudest oneliner: for i in $(seq 1 9999); do echo "$i ============="; \ QTEST_QEMU_BINARY=./qemu-system-x86_64 \ ./tests/qtest/migration-test -r /x86_64/migration/postcopy/recovery || break ; done I suspect my system has something specific to it that affects the timing of the tests. But I have no idea what it could be. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz CPU family: 6 Model: 141 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 > > $ cat reproduce.sh > index=$1 > loop=0 > > while :; do > echo "Starting loop=$loop..." > QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test > -p /x86_64/migration/postcopy/recovery/double-failures > if [[ $? != 0 ]]; then > echo "index $index REPRODUCED (loop=$loop) !" > break > fi > loop=$(( loop + 1 )) > done > > Survives 200+ loops and kept going. > > However I think I saw what's wrong here, could you help try below fixup? > Sure. I won't get to it until tomorrow though.