Re: QEMU migration-test CI intermittent failure

Fabiano Rosas Mon, 18 Sep 2023 07:17:16 -0700

Peter Xu <pet...@redhat.com> writes:

> On Fri, Sep 15, 2023 at 12:28:36PM -0400, Peter Xu wrote:
>> On Thu, Sep 14, 2023 at 10:56:23PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <pet...@redhat.com> writes:
>> > 
>> > > On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote:
>> > >> Fabiano Rosas <faro...@suse.de> writes:
>> > >> 
>> > >> > Peter Xu <pet...@redhat.com> writes:
>> > >> >
>> > >> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
>> > >> >>> I managed to reproduce it. It's not the return path error. In 
>> > >> >>> hindsight
>> > >> >>> that's obvious because that error happens in the 'recovery' test 
>> > >> >>> and this
>> > >> >>> one in the 'plain' one. Sorry about the noise.
>> > >> >>
>> > >> >> No worry.  It's good to finally identify that.
>> > >> >>
>> > >> >>> 
>> > >> >>> This one reproduced with just 4 iterations of preempt/plain. I'll
>> > >> >>> investigate.
>> > >> >
>> > >> > It seems that we're getting a tcp disconnect (ECONNRESET) on when 
>> > >> > doing
>> > >> > that shutdown() on postcopy_qemufile_src. The one from commit 
>> > >> > 6621883f93
>> > >> > ("migration: Fix potential race on postcopy_qemufile_src").
>> > >> >
>> > >> > I'm trying to determine why that happens when other times it just
>> > >> > returns 0 as expected.
>> > >> >
>> > >> > Could this mean that we're kicking the dest too soon while it is still
>> > >> > receiving valid data?
>> > >> 
>> > >> Looking a bit more into this, what's happening is that
>> > >> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
>> > >> while ram_load_postcopy() is still running.
>> > >> 
>> > >> The postcopy_ram_listen_thread() function waits for the
>> > >> main_thread_load_event, but that only works when not using preempt. With
>> > >> the preempt thread, the event is set right away and we proceed to do the
>> > >> cleanup without waiting.
>> > >> 
>> > >> So the assumption of commit 6621883f93 that the incoming side knows when
>> > >> it has finished migrating is wrong IMO. Without the EOS we're relying on
>> > >> the chance that the shutdown() happens after the last recvmsg has
>> > >> returned and not during it.
>> > >> 
>> > >> Peter, what do you think?
>> > >
>> > > That's a good point.
>> > >
>> > > One thing to verify that (sorry, I still cannot reproduce it myself, 
>> > > which
>> > > is so weirdly... it seems loads won't really help reproducing this) is to
>> > > let the main thread wait for all requested pages to arrive:
>> > >
>> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> > > index 29aea9456d..df055c51ea 100644
>> > > --- a/migration/postcopy-ram.c
>> > > +++ b/migration/postcopy-ram.c
>> > > @@ -597,6 +597,12 @@ int 
>> > > postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>> > >      trace_postcopy_ram_incoming_cleanup_entry();
>> > >  
>> > >      if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
>> > > +        /*
>> > > +         * NOTE!  it's possible that the preempt thread is still 
>> > > handling
>> > > +         * the last pages to arrive which were requested by faults.  
>> > > Making
>> > > +         * sure nothing is left behind.
>> > > +         */
>> > > +        while (qatomic_read(&mis->page_requested_count));
>> > >          /* Notify the fast load thread to quit */
>> > >          mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
>> > >          if (mis->postcopy_qemufile_dst) {
>> > >
>> > > If that can work solidly, we can figure out a better way than a dead loop
>> > > here.
>> > 
>> > Yep, 2000+ iterations so far and no error.
>> 
>> Definitely good news.. thanks.
>> 
>> > 
>> > Should we add something that makes ram_load_postcopy return once it's
>> > finished? Then this code could just set PREEMPT_THREAD_QUIT and join the
>> > preempt thread.
>> 
>> Yeah it'll be nice to have that. We used to have an EOS message at the end
>> for the preempt channel, but that used to cause another race (which IIRC I
>> could easily reproduce at that time) and we removed it after
>> 6621883f9398bc3 for 7.2+.
>> 
>> Now instead of fiddling with the protocol again, we do have a way to detect
>> that, but it may need some trick on counting on page_requested_count and
>> also provide a signal mechanism.
>> 
>> I've drafted one solution here and attached here as a complete patch, feel
>> free to try.  If you have anything better, that'll be even nicer; just let
>> me know your thoughts.
>> 
>> ---8<---
>> From 384aff2264650872e15916dcfeaec593e5e9b781 Mon Sep 17 00:00:00 2001
>> From: Peter Xu <pet...@redhat.com>
>> Date: Fri, 15 Sep 2023 12:11:34 -0400
>> Subject: [PATCH] migration: Fix race that dest preempt thread close too early
>> 
>> We hit intermit CI issue on failing at migration-test over the unit test
>> preempt/plain:
>> 
>> qemu-system-x86_64: Unable to read from socket: Connection reset by peer
>> Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc 
>> current = 4f hit_edge = 1
>> **
>> ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion 
>> failed: (bad == 0)
>> (test program exited with status code -6)
>> 
>> Fabiano debugged into it and found that the preempt thread can quit even
>> without receiving all the pages, which can cause guest not receiving all
>> the pages and corrupt the guest memory.
>> 
>> To make sure preempt thread finished receiving all the pages, we can rely
>> on the page_requested_count being zero because preempt channel will only
>> receive requested page faults. Note, not all the faulted pages are required
>> to be sent via the preempt channel/thread; imagine the case when a
>> requested page is just queued into the background main channel for
>> migration, the src qemu will just still send it via the background channel.
>> 
>> Here instead of spinning over reading the count, we add a condvar so the
>> main thread can wait on it if that unusual case happened, without burning
>> the cpu for no good reason, even if the duration is short; so even if we
>> spin in this rare case is probably fine.  It's just better to not do so.
>> 
>> The condvar is only used when that special case is triggered.  Some memory
>> ordering trick is needed to guarantee it from happening (against the
>> preempt thread status field), so the main thread will always get a kick
>> when that triggers correctly.
>> 
>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886
>> Debugged-by: Fabiano Rosas <faro...@suse.de>
>> Signed-off-by: Peter Xu <pet...@redhat.com>
>> ---
>>  migration/migration.h    | 13 ++++++++++++-
>>  migration/migration.c    |  1 +
>>  migration/postcopy-ram.c | 35 +++++++++++++++++++++++++++++++++++
>>  3 files changed, 48 insertions(+), 1 deletion(-)
>> 
>> diff --git a/migration/migration.h b/migration/migration.h
>> index c390500604..cdaa10d515 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -196,7 +196,10 @@ struct MigrationIncomingState {
>>  
>>      /* A tree of pages that we requested to the source VM */
>>      GTree *page_requested;
>> -    /* For debugging purpose only, but would be nice to keep */
>> +    /*
>> +     * For postcopy only, count the number of requested page faults that
>> +     * still haven't been resolved.
>> +     */
>>      int page_requested_count;
>>      /*
>>       * The mutex helps to maintain the requested pages that we sent to the
>> @@ -210,6 +213,14 @@ struct MigrationIncomingState {
>>       * contains valid information.
>>       */
>>      QemuMutex page_request_mutex;
>> +    /*
>> +     * If postcopy preempt is enabled, there is a chance that the main
>> +     * thread finished loading its data before the preempt channel has
>> +     * finished loading the urgent pages.  If that happens, the two threads
>> +     * will use this condvar to synchronize, so the main thread will always
>> +     * wait until all pages received.
>> +     */
>> +    QemuCond page_request_cond;
>>  
>>      /*
>>       * Number of devices that have yet to approve switchover. When this 
>> reaches
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d61e572742..e86b3f7368 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -153,6 +153,7 @@ void migration_object_init(void)
>>      qemu_sem_init(&current_incoming->postcopy_qemufile_dst_done, 0);
>>  
>>      qemu_mutex_init(&current_incoming->page_request_mutex);
>> +    qemu_cond_init(&current_incoming->page_request_cond);
>>      current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>>  
>>      migration_object_check(current_migration, &error_fatal);
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 29aea9456d..dbf02cd3ed 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -599,6 +599,30 @@ int 
>> postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>>      if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
>>          /* Notify the fast load thread to quit */
>>          mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
>> +        /*
>> +         * Update preempt_thread_status before reading count.  Note: mutex
>> +         * lock only provide ACQUIRE semantic, and it doesn't stops this
>> +         * write to be reordered after reading the count.
>> +         */
>> +        smp_mb();
>> +        /*
>> +         * It's possible that the preempt thread is still handling the last
>> +         * pages to arrive which were requested by guest page faults.
>> +         * Making sure nothing is left behind by waiting on the condvar if
>> +         * that unlikely case happened.
>> +         */
>> +        WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
>> +            if (qatomic_read(&mis->page_requested_count)) {
>> +                /*
>> +                 * It is guaranteed to receive a signal later, because the
>> +                 * count>0 now, so it's destined to be decreased to zero
>> +                 * very soon by the preempt thread.
>> +                 */
>> +                qemu_cond_wait(&mis->page_request_cond,
>> +                               &mis->page_request_mutex);
>> +            }
>> +        }
>> +        /* Notify the fast load thread to quit */
>>          if (mis->postcopy_qemufile_dst) {
>>              qemu_file_shutdown(mis->postcopy_qemufile_dst);
>>          }
>> @@ -1279,6 +1303,17 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState 
>> *mis, void *host_addr,
>>              g_tree_remove(mis->page_requested, host_addr);
>>              mis->page_requested_count--;
>>              trace_postcopy_page_req_del(host_addr, 
>> mis->page_requested_count);
>> +            /* Order the update of count and read of preempt status */
>> +            smp_mb();
>> +            if (qatomic_read(&mis->preempt_thread_status) ==
>> +                PREEMPT_THREAD_QUIT) {
>> +                /*
>> +                 * This probably means the main thread is waiting for us.
>> +                 * Notify that we've finished receiving the last requested
>> +                 * page.
>> +                 */
>> +                qemu_cond_signal(&mis->page_request_cond);
>> +            }
>>          }
>>          qemu_mutex_unlock(&mis->page_request_mutex);
>>          mark_postcopy_blocktime_end((uintptr_t)host_addr);
>> -- 
>> 2.41.0
>
> I just made some changes on top of this patch..
>
> One major thing is to double check we only kick main thread only if it's
> the last page, as I think there can be >1 pages on the fly..
>
> Since at it, I cleaned up a bit on accessing either the preempt status
> field or the counter, to make it always consistent (always not use atomic
> ops for status as it's defined as volatile anyway; meanwhile always use
> qatomic for the counter).
>
> ---8<---
> From bfdd1f872fbeafe26ebad280a6d9e5a88de621de Mon Sep 17 00:00:00 2001
> From: Peter Xu <pet...@redhat.com>
> Date: Fri, 15 Sep 2023 12:45:01 -0400
> Subject: [PATCH] fixup! migration: Fix race that dest preempt thread close too
>  early
>
> Three more small changes:
>
>   - Always use qatomic_* for page_requested_count accesses
>   - only kick main thread when we're sure it's the last page
>   - drop qatomic_read() on reading preempt status, to make it match use
>     cases elsewhere, not needed when defined as volatile already
>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> ---
>  migration/migration.c    | 2 +-
>  migration/postcopy-ram.c | 7 ++++---
>  2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index e86b3f7368..3ee1e6b0d6 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -368,7 +368,7 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>               * things like g_tree_lookup() will return TRUE (1) when found.
>               */
>              g_tree_insert(mis->page_requested, aligned, (gpointer)1);
> -            mis->page_requested_count++;
> +            qatomic_inc(&mis->page_requested_count);
>              trace_postcopy_page_req_add(aligned, mis->page_requested_count);
>          }
>      }
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index dbf02cd3ed..5408e028c6 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1301,12 +1301,13 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState 
> *mis, void *host_addr,
>           */
>          if (g_tree_lookup(mis->page_requested, host_addr)) {
>              g_tree_remove(mis->page_requested, host_addr);
> -            mis->page_requested_count--;
> +            int left_pages = qatomic_dec_fetch(&mis->page_requested_count);
> +
>              trace_postcopy_page_req_del(host_addr, 
> mis->page_requested_count);
>              /* Order the update of count and read of preempt status */
>              smp_mb();
> -            if (qatomic_read(&mis->preempt_thread_status) ==
> -                PREEMPT_THREAD_QUIT) {
> +            if (mis->preempt_thread_status == PREEMPT_THREAD_QUIT &&
> +                left_pages == 0) {
>                  /*
>                   * This probably means the main thread is waiting for us.
>                   * Notify that we've finished receiving the last requested
> -- 
> 2.41.0


This fixes the issue. It looks ok to me. Do you want me to send a separate 
patch with it?

Re: QEMU migration-test CI intermittent failure

Reply via email to