On Thu, Oct 05, 2023 at 06:10:20PM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Thu, Oct 05, 2023 at 10:37:56AM -0300, Fabiano Rosas wrote: > >> >> + /* > >> >> + * Make sure both QEMU instances will go into RECOVER stage, then > >> >> test > >> >> + * kicking them out using migrate-pause. > >> >> + */ > >> >> + wait_for_postcopy_status(from, "postcopy-recover"); > >> >> + wait_for_postcopy_status(to, "postcopy-recover"); > >> > > >> > Is this wait out of place? I think we're trying to resume too fast after > >> > migrate_recover(): > >> > > >> > # { > >> > # "error": { > >> > # "class": "GenericError", > >> > # "desc": "Cannot resume if there is no paused migration" > >> > # } > >> > # } > >> > > >> > >> Ugh, sorry about the long lines: > >> > >> { > >> "error": { > >> "class": "GenericError", > >> "desc": "Cannot resume if there is no paused migration" > >> } > >> } > > > > Sorry I didn't get you here. Could you elaborate your question? > > > > The test is sometimes failing with the above message. > > But indeed my question doesn't make sense. I forgot migrate_recover > happens on the destination. Nevermind. > > The bug is still present nonetheless. We're going into migrate_prepare > in some state other than POSTCOPY_PAUSED.
Oh I see. Interestingly I cannot reproduce on my host, just like last time.. What is your setup for running the test? Anything special? Here's my cmdline: $ cat reproduce.sh index=$1 loop=0 while :; do echo "Starting loop=$loop..." QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/postcopy/recovery/double-failures if [[ $? != 0 ]]; then echo "index $index REPRODUCED (loop=$loop) !" break fi loop=$(( loop + 1 )) done Survives 200+ loops and kept going. However I think I saw what's wrong here, could you help try below fixup? Thanks, ===8<=== >From 52bd2cd5ddf472e0bb99789dba3660a626382630 Mon Sep 17 00:00:00 2001 From: Peter Xu <pet...@redhat.com> Date: Thu, 5 Oct 2023 17:38:42 -0400 Subject: [PATCH] fixup! tests/migration-test: Add a test for postcopy hangs during RECOVER Signed-off-by: Peter Xu <pet...@redhat.com> --- tests/qtest/migration-test.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index fb7a3765e4..1bdae0a579 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -1489,9 +1489,8 @@ static void test_postcopy_recovery_common(MigrateCommon *args) * migrate-recover command can only succeed if destination machine * is in the paused state */ - wait_for_migration_status(to, "postcopy-paused", - (const char * []) { "failed", "active", - "completed", NULL }); + wait_for_postcopy_status(to, "postcopy-paused"); + wait_for_postcopy_status(from, "postcopy-paused"); if (args->postcopy_recovery_test_fail) { /* @@ -1514,9 +1513,6 @@ static void test_postcopy_recovery_common(MigrateCommon *args) * Try to rebuild the migration channel using the resume flag and * the newly created channel */ - wait_for_migration_status(from, "postcopy-paused", - (const char * []) { "failed", "active", - "completed", NULL }); migrate_qmp(from, uri, "{'resume': true}"); /* Restore the postcopy bandwidth to unlimited */ -- 2.41.0 -- Peter Xu