Re: QEMU migration-test CI intermittent failure

Peter Xu Fri, 15 Sep 2023 09:56:35 -0700

On Fri, Sep 15, 2023 at 12:28:36PM -0400, Peter Xu wrote:
> On Thu, Sep 14, 2023 at 10:56:23PM -0300, Fabiano Rosas wrote:
> > Peter Xu <pet...@redhat.com> writes:
> > 
> > > On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote:
> > >> Fabiano Rosas <faro...@suse.de> writes:
> > >> 
> > >> > Peter Xu <pet...@redhat.com> writes:
> > >> >
> > >> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
> > >> >>> I managed to reproduce it. It's not the return path error. In 
> > >> >>> hindsight
> > >> >>> that's obvious because that error happens in the 'recovery' test and 
> > >> >>> this
> > >> >>> one in the 'plain' one. Sorry about the noise.
> > >> >>
> > >> >> No worry.  It's good to finally identify that.
> > >> >>
> > >> >>> 
> > >> >>> This one reproduced with just 4 iterations of preempt/plain. I'll
> > >> >>> investigate.
> > >> >
> > >> > It seems that we're getting a tcp disconnect (ECONNRESET) on when doing
> > >> > that shutdown() on postcopy_qemufile_src. The one from commit 
> > >> > 6621883f93
> > >> > ("migration: Fix potential race on postcopy_qemufile_src").
> > >> >
> > >> > I'm trying to determine why that happens when other times it just
> > >> > returns 0 as expected.
> > >> >
> > >> > Could this mean that we're kicking the dest too soon while it is still
> > >> > receiving valid data?
> > >> 
> > >> Looking a bit more into this, what's happening is that
> > >> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
> > >> while ram_load_postcopy() is still running.
> > >> 
> > >> The postcopy_ram_listen_thread() function waits for the
> > >> main_thread_load_event, but that only works when not using preempt. With
> > >> the preempt thread, the event is set right away and we proceed to do the
> > >> cleanup without waiting.
> > >> 
> > >> So the assumption of commit 6621883f93 that the incoming side knows when
> > >> it has finished migrating is wrong IMO. Without the EOS we're relying on
> > >> the chance that the shutdown() happens after the last recvmsg has
> > >> returned and not during it.
> > >> 
> > >> Peter, what do you think?
> > >
> > > That's a good point.
> > >
> > > One thing to verify that (sorry, I still cannot reproduce it myself, which
> > > is so weirdly... it seems loads won't really help reproducing this) is to
> > > let the main thread wait for all requested pages to arrive:
> > >
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > index 29aea9456d..df055c51ea 100644
> > > --- a/migration/postcopy-ram.c
> > > +++ b/migration/postcopy-ram.c
> > > @@ -597,6 +597,12 @@ int 
> > > postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> > >      trace_postcopy_ram_incoming_cleanup_entry();
> > >  
> > >      if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
> > > +        /*
> > > +         * NOTE!  it's possible that the preempt thread is still handling
> > > +         * the last pages to arrive which were requested by faults.  
> > > Making
> > > +         * sure nothing is left behind.
> > > +         */
> > > +        while (qatomic_read(&mis->page_requested_count));
> > >          /* Notify the fast load thread to quit */
> > >          mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
> > >          if (mis->postcopy_qemufile_dst) {
> > >
> > > If that can work solidly, we can figure out a better way than a dead loop
> > > here.
> > 
> > Yep, 2000+ iterations so far and no error.
> 
> Definitely good news.. thanks.
> 
> > 
> > Should we add something that makes ram_load_postcopy return once it's
> > finished? Then this code could just set PREEMPT_THREAD_QUIT and join the
> > preempt thread.
> 
> Yeah it'll be nice to have that. We used to have an EOS message at the end
> for the preempt channel, but that used to cause another race (which IIRC I
> could easily reproduce at that time) and we removed it after
> 6621883f9398bc3 for 7.2+.
> 
> Now instead of fiddling with the protocol again, we do have a way to detect
> that, but it may need some trick on counting on page_requested_count and
> also provide a signal mechanism.
> 
> I've drafted one solution here and attached here as a complete patch, feel
> free to try.  If you have anything better, that'll be even nicer; just let
> me know your thoughts.
> 
> ---8<---
> From 384aff2264650872e15916dcfeaec593e5e9b781 Mon Sep 17 00:00:00 2001
> From: Peter Xu <pet...@redhat.com>
> Date: Fri, 15 Sep 2023 12:11:34 -0400
> Subject: [PATCH] migration: Fix race that dest preempt thread close too early
> 
> We hit intermit CI issue on failing at migration-test over the unit test
> preempt/plain:
> 
> qemu-system-x86_64: Unable to read from socket: Connection reset by peer
> Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc 
> current = 4f hit_edge = 1
> **
> ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion failed: 
> (bad == 0)
> (test program exited with status code -6)
> 
> Fabiano debugged into it and found that the preempt thread can quit even
> without receiving all the pages, which can cause guest not receiving all
> the pages and corrupt the guest memory.
> 
> To make sure preempt thread finished receiving all the pages, we can rely
> on the page_requested_count being zero because preempt channel will only
> receive requested page faults. Note, not all the faulted pages are required
> to be sent via the preempt channel/thread; imagine the case when a
> requested page is just queued into the background main channel for
> migration, the src qemu will just still send it via the background channel.
> 
> Here instead of spinning over reading the count, we add a condvar so the
> main thread can wait on it if that unusual case happened, without burning
> the cpu for no good reason, even if the duration is short; so even if we
> spin in this rare case is probably fine.  It's just better to not do so.
> 
> The condvar is only used when that special case is triggered.  Some memory
> ordering trick is needed to guarantee it from happening (against the
> preempt thread status field), so the main thread will always get a kick
> when that triggers correctly.
> 
> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886
> Debugged-by: Fabiano Rosas <faro...@suse.de>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> ---
>  migration/migration.h    | 13 ++++++++++++-
>  migration/migration.c    |  1 +
>  migration/postcopy-ram.c | 35 +++++++++++++++++++++++++++++++++++
>  3 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index c390500604..cdaa10d515 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -196,7 +196,10 @@ struct MigrationIncomingState {
>  
>      /* A tree of pages that we requested to the source VM */
>      GTree *page_requested;
> -    /* For debugging purpose only, but would be nice to keep */
> +    /*
> +     * For postcopy only, count the number of requested page faults that
> +     * still haven't been resolved.
> +     */
>      int page_requested_count;
>      /*
>       * The mutex helps to maintain the requested pages that we sent to the
> @@ -210,6 +213,14 @@ struct MigrationIncomingState {
>       * contains valid information.
>       */
>      QemuMutex page_request_mutex;
> +    /*
> +     * If postcopy preempt is enabled, there is a chance that the main
> +     * thread finished loading its data before the preempt channel has
> +     * finished loading the urgent pages.  If that happens, the two threads
> +     * will use this condvar to synchronize, so the main thread will always
> +     * wait until all pages received.
> +     */
> +    QemuCond page_request_cond;
>  
>      /*
>       * Number of devices that have yet to approve switchover. When this 
> reaches
> diff --git a/migration/migration.c b/migration/migration.c
> index d61e572742..e86b3f7368 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -153,6 +153,7 @@ void migration_object_init(void)
>      qemu_sem_init(&current_incoming->postcopy_qemufile_dst_done, 0);
>  
>      qemu_mutex_init(&current_incoming->page_request_mutex);
> +    qemu_cond_init(&current_incoming->page_request_cond);
>      current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>  
>      migration_object_check(current_migration, &error_fatal);
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 29aea9456d..dbf02cd3ed 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -599,6 +599,30 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState 
> *mis)
>      if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
>          /* Notify the fast load thread to quit */
>          mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
> +        /*
> +         * Update preempt_thread_status before reading count.  Note: mutex
> +         * lock only provide ACQUIRE semantic, and it doesn't stops this
> +         * write to be reordered after reading the count.
> +         */
> +        smp_mb();
> +        /*
> +         * It's possible that the preempt thread is still handling the last
> +         * pages to arrive which were requested by guest page faults.
> +         * Making sure nothing is left behind by waiting on the condvar if
> +         * that unlikely case happened.
> +         */
> +        WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
> +            if (qatomic_read(&mis->page_requested_count)) {
> +                /*
> +                 * It is guaranteed to receive a signal later, because the
> +                 * count>0 now, so it's destined to be decreased to zero
> +                 * very soon by the preempt thread.
> +                 */
> +                qemu_cond_wait(&mis->page_request_cond,
> +                               &mis->page_request_mutex);
> +            }
> +        }
> +        /* Notify the fast load thread to quit */
>          if (mis->postcopy_qemufile_dst) {
>              qemu_file_shutdown(mis->postcopy_qemufile_dst);
>          }
> @@ -1279,6 +1303,17 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState 
> *mis, void *host_addr,
>              g_tree_remove(mis->page_requested, host_addr);
>              mis->page_requested_count--;
>              trace_postcopy_page_req_del(host_addr, 
> mis->page_requested_count);
> +            /* Order the update of count and read of preempt status */
> +            smp_mb();
> +            if (qatomic_read(&mis->preempt_thread_status) ==
> +                PREEMPT_THREAD_QUIT) {
> +                /*
> +                 * This probably means the main thread is waiting for us.
> +                 * Notify that we've finished receiving the last requested
> +                 * page.
> +                 */
> +                qemu_cond_signal(&mis->page_request_cond);
> +            }
>          }
>          qemu_mutex_unlock(&mis->page_request_mutex);
>          mark_postcopy_blocktime_end((uintptr_t)host_addr);
> -- 
> 2.41.0


I just made some changes on top of this patch..

One major thing is to double check we only kick main thread only if it's
the last page, as I think there can be >1 pages on the fly..

Since at it, I cleaned up a bit on accessing either the preempt status
field or the counter, to make it always consistent (always not use atomic
ops for status as it's defined as volatile anyway; meanwhile always use
qatomic for the counter).

---8<---
>From bfdd1f872fbeafe26ebad280a6d9e5a88de621de Mon Sep 17 00:00:00 2001
From: Peter Xu <pet...@redhat.com>
Date: Fri, 15 Sep 2023 12:45:01 -0400
Subject: [PATCH] fixup! migration: Fix race that dest preempt thread close too
 early

Three more small changes:

  - Always use qatomic_* for page_requested_count accesses
  - only kick main thread when we're sure it's the last page
  - drop qatomic_read() on reading preempt status, to make it match use
    cases elsewhere, not needed when defined as volatile already

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 migration/migration.c    | 2 +-
 migration/postcopy-ram.c | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e86b3f7368..3ee1e6b0d6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -368,7 +368,7 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
              * things like g_tree_lookup() will return TRUE (1) when found.
              */
             g_tree_insert(mis->page_requested, aligned, (gpointer)1);
-            mis->page_requested_count++;
+            qatomic_inc(&mis->page_requested_count);
             trace_postcopy_page_req_add(aligned, mis->page_requested_count);
         }
     }
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index dbf02cd3ed..5408e028c6 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1301,12 +1301,13 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState 
*mis, void *host_addr,
          */
         if (g_tree_lookup(mis->page_requested, host_addr)) {
             g_tree_remove(mis->page_requested, host_addr);
-            mis->page_requested_count--;
+            int left_pages = qatomic_dec_fetch(&mis->page_requested_count);
+
             trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
             /* Order the update of count and read of preempt status */
             smp_mb();
-            if (qatomic_read(&mis->preempt_thread_status) ==
-                PREEMPT_THREAD_QUIT) {
+            if (mis->preempt_thread_status == PREEMPT_THREAD_QUIT &&
+                left_pages == 0) {
                 /*
                  * This probably means the main thread is waiting for us.
                  * Notify that we've finished receiving the last requested
-- 
2.41.0


-- 
Peter Xu

Re: QEMU migration-test CI intermittent failure

Reply via email to