Peter Xu <pet...@redhat.com> writes: > On Fri, Sep 15, 2023 at 12:28:36PM -0400, Peter Xu wrote: >> On Thu, Sep 14, 2023 at 10:56:23PM -0300, Fabiano Rosas wrote: >> > Peter Xu <pet...@redhat.com> writes: >> > >> > > On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote: >> > >> Fabiano Rosas <faro...@suse.de> writes: >> > >> >> > >> > Peter Xu <pet...@redhat.com> writes: >> > >> > >> > >> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote: >> > >> >>> I managed to reproduce it. It's not the return path error. In >> > >> >>> hindsight >> > >> >>> that's obvious because that error happens in the 'recovery' test >> > >> >>> and this >> > >> >>> one in the 'plain' one. Sorry about the noise. >> > >> >> >> > >> >> No worry. It's good to finally identify that. >> > >> >> >> > >> >>> >> > >> >>> This one reproduced with just 4 iterations of preempt/plain. I'll >> > >> >>> investigate. >> > >> > >> > >> > It seems that we're getting a tcp disconnect (ECONNRESET) on when >> > >> > doing >> > >> > that shutdown() on postcopy_qemufile_src. The one from commit >> > >> > 6621883f93 >> > >> > ("migration: Fix potential race on postcopy_qemufile_src"). >> > >> > >> > >> > I'm trying to determine why that happens when other times it just >> > >> > returns 0 as expected. >> > >> > >> > >> > Could this mean that we're kicking the dest too soon while it is still >> > >> > receiving valid data? >> > >> >> > >> Looking a bit more into this, what's happening is that >> > >> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst >> > >> while ram_load_postcopy() is still running. >> > >> >> > >> The postcopy_ram_listen_thread() function waits for the >> > >> main_thread_load_event, but that only works when not using preempt. With >> > >> the preempt thread, the event is set right away and we proceed to do the >> > >> cleanup without waiting. >> > >> >> > >> So the assumption of commit 6621883f93 that the incoming side knows when >> > >> it has finished migrating is wrong IMO. Without the EOS we're relying on >> > >> the chance that the shutdown() happens after the last recvmsg has >> > >> returned and not during it. >> > >> >> > >> Peter, what do you think? >> > > >> > > That's a good point. >> > > >> > > One thing to verify that (sorry, I still cannot reproduce it myself, >> > > which >> > > is so weirdly... it seems loads won't really help reproducing this) is to >> > > let the main thread wait for all requested pages to arrive: >> > > >> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c >> > > index 29aea9456d..df055c51ea 100644 >> > > --- a/migration/postcopy-ram.c >> > > +++ b/migration/postcopy-ram.c >> > > @@ -597,6 +597,12 @@ int >> > > postcopy_ram_incoming_cleanup(MigrationIncomingState *mis) >> > > trace_postcopy_ram_incoming_cleanup_entry(); >> > > >> > > if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) { >> > > + /* >> > > + * NOTE! it's possible that the preempt thread is still >> > > handling >> > > + * the last pages to arrive which were requested by faults. >> > > Making >> > > + * sure nothing is left behind. >> > > + */ >> > > + while (qatomic_read(&mis->page_requested_count)); >> > > /* Notify the fast load thread to quit */ >> > > mis->preempt_thread_status = PREEMPT_THREAD_QUIT; >> > > if (mis->postcopy_qemufile_dst) { >> > > >> > > If that can work solidly, we can figure out a better way than a dead loop >> > > here. >> > >> > Yep, 2000+ iterations so far and no error. >> >> Definitely good news.. thanks. >> >> > >> > Should we add something that makes ram_load_postcopy return once it's >> > finished? Then this code could just set PREEMPT_THREAD_QUIT and join the >> > preempt thread. >> >> Yeah it'll be nice to have that. We used to have an EOS message at the end >> for the preempt channel, but that used to cause another race (which IIRC I >> could easily reproduce at that time) and we removed it after >> 6621883f9398bc3 for 7.2+. >> >> Now instead of fiddling with the protocol again, we do have a way to detect >> that, but it may need some trick on counting on page_requested_count and >> also provide a signal mechanism. >> >> I've drafted one solution here and attached here as a complete patch, feel >> free to try. If you have anything better, that'll be even nicer; just let >> me know your thoughts. >> >> ---8<--- >> From 384aff2264650872e15916dcfeaec593e5e9b781 Mon Sep 17 00:00:00 2001 >> From: Peter Xu <pet...@redhat.com> >> Date: Fri, 15 Sep 2023 12:11:34 -0400 >> Subject: [PATCH] migration: Fix race that dest preempt thread close too early >> >> We hit intermit CI issue on failing at migration-test over the unit test >> preempt/plain: >> >> qemu-system-x86_64: Unable to read from socket: Connection reset by peer >> Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc >> current = 4f hit_edge = 1 >> ** >> ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion >> failed: (bad == 0) >> (test program exited with status code -6) >> >> Fabiano debugged into it and found that the preempt thread can quit even >> without receiving all the pages, which can cause guest not receiving all >> the pages and corrupt the guest memory. >> >> To make sure preempt thread finished receiving all the pages, we can rely >> on the page_requested_count being zero because preempt channel will only >> receive requested page faults. Note, not all the faulted pages are required >> to be sent via the preempt channel/thread; imagine the case when a >> requested page is just queued into the background main channel for >> migration, the src qemu will just still send it via the background channel. >> >> Here instead of spinning over reading the count, we add a condvar so the >> main thread can wait on it if that unusual case happened, without burning >> the cpu for no good reason, even if the duration is short; so even if we >> spin in this rare case is probably fine. It's just better to not do so. >> >> The condvar is only used when that special case is triggered. Some memory >> ordering trick is needed to guarantee it from happening (against the >> preempt thread status field), so the main thread will always get a kick >> when that triggers correctly. >> >> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886 >> Debugged-by: Fabiano Rosas <faro...@suse.de> >> Signed-off-by: Peter Xu <pet...@redhat.com> >> --- >> migration/migration.h | 13 ++++++++++++- >> migration/migration.c | 1 + >> migration/postcopy-ram.c | 35 +++++++++++++++++++++++++++++++++++ >> 3 files changed, 48 insertions(+), 1 deletion(-) >> >> diff --git a/migration/migration.h b/migration/migration.h >> index c390500604..cdaa10d515 100644 >> --- a/migration/migration.h >> +++ b/migration/migration.h >> @@ -196,7 +196,10 @@ struct MigrationIncomingState { >> >> /* A tree of pages that we requested to the source VM */ >> GTree *page_requested; >> - /* For debugging purpose only, but would be nice to keep */ >> + /* >> + * For postcopy only, count the number of requested page faults that >> + * still haven't been resolved. >> + */ >> int page_requested_count; >> /* >> * The mutex helps to maintain the requested pages that we sent to the >> @@ -210,6 +213,14 @@ struct MigrationIncomingState { >> * contains valid information. >> */ >> QemuMutex page_request_mutex; >> + /* >> + * If postcopy preempt is enabled, there is a chance that the main >> + * thread finished loading its data before the preempt channel has >> + * finished loading the urgent pages. If that happens, the two threads >> + * will use this condvar to synchronize, so the main thread will always >> + * wait until all pages received. >> + */ >> + QemuCond page_request_cond; >> >> /* >> * Number of devices that have yet to approve switchover. When this >> reaches >> diff --git a/migration/migration.c b/migration/migration.c >> index d61e572742..e86b3f7368 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -153,6 +153,7 @@ void migration_object_init(void) >> qemu_sem_init(¤t_incoming->postcopy_qemufile_dst_done, 0); >> >> qemu_mutex_init(¤t_incoming->page_request_mutex); >> + qemu_cond_init(¤t_incoming->page_request_cond); >> current_incoming->page_requested = g_tree_new(page_request_addr_cmp); >> >> migration_object_check(current_migration, &error_fatal); >> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c >> index 29aea9456d..dbf02cd3ed 100644 >> --- a/migration/postcopy-ram.c >> +++ b/migration/postcopy-ram.c >> @@ -599,6 +599,30 @@ int >> postcopy_ram_incoming_cleanup(MigrationIncomingState *mis) >> if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) { >> /* Notify the fast load thread to quit */ >> mis->preempt_thread_status = PREEMPT_THREAD_QUIT; >> + /* >> + * Update preempt_thread_status before reading count. Note: mutex >> + * lock only provide ACQUIRE semantic, and it doesn't stops this >> + * write to be reordered after reading the count. >> + */ >> + smp_mb(); >> + /* >> + * It's possible that the preempt thread is still handling the last >> + * pages to arrive which were requested by guest page faults. >> + * Making sure nothing is left behind by waiting on the condvar if >> + * that unlikely case happened. >> + */ >> + WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) { >> + if (qatomic_read(&mis->page_requested_count)) { >> + /* >> + * It is guaranteed to receive a signal later, because the >> + * count>0 now, so it's destined to be decreased to zero >> + * very soon by the preempt thread. >> + */ >> + qemu_cond_wait(&mis->page_request_cond, >> + &mis->page_request_mutex); >> + } >> + } >> + /* Notify the fast load thread to quit */ >> if (mis->postcopy_qemufile_dst) { >> qemu_file_shutdown(mis->postcopy_qemufile_dst); >> } >> @@ -1279,6 +1303,17 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState >> *mis, void *host_addr, >> g_tree_remove(mis->page_requested, host_addr); >> mis->page_requested_count--; >> trace_postcopy_page_req_del(host_addr, >> mis->page_requested_count); >> + /* Order the update of count and read of preempt status */ >> + smp_mb(); >> + if (qatomic_read(&mis->preempt_thread_status) == >> + PREEMPT_THREAD_QUIT) { >> + /* >> + * This probably means the main thread is waiting for us. >> + * Notify that we've finished receiving the last requested >> + * page. >> + */ >> + qemu_cond_signal(&mis->page_request_cond); >> + } >> } >> qemu_mutex_unlock(&mis->page_request_mutex); >> mark_postcopy_blocktime_end((uintptr_t)host_addr); >> -- >> 2.41.0 > > I just made some changes on top of this patch.. > > One major thing is to double check we only kick main thread only if it's > the last page, as I think there can be >1 pages on the fly.. > > Since at it, I cleaned up a bit on accessing either the preempt status > field or the counter, to make it always consistent (always not use atomic > ops for status as it's defined as volatile anyway; meanwhile always use > qatomic for the counter). > > ---8<--- > From bfdd1f872fbeafe26ebad280a6d9e5a88de621de Mon Sep 17 00:00:00 2001 > From: Peter Xu <pet...@redhat.com> > Date: Fri, 15 Sep 2023 12:45:01 -0400 > Subject: [PATCH] fixup! migration: Fix race that dest preempt thread close too > early > > Three more small changes: > > - Always use qatomic_* for page_requested_count accesses > - only kick main thread when we're sure it's the last page > - drop qatomic_read() on reading preempt status, to make it match use > cases elsewhere, not needed when defined as volatile already > > Signed-off-by: Peter Xu <pet...@redhat.com> > --- > migration/migration.c | 2 +- > migration/postcopy-ram.c | 7 ++++--- > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index e86b3f7368..3ee1e6b0d6 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -368,7 +368,7 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis, > * things like g_tree_lookup() will return TRUE (1) when found. > */ > g_tree_insert(mis->page_requested, aligned, (gpointer)1); > - mis->page_requested_count++; > + qatomic_inc(&mis->page_requested_count); > trace_postcopy_page_req_add(aligned, mis->page_requested_count); > } > } > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > index dbf02cd3ed..5408e028c6 100644 > --- a/migration/postcopy-ram.c > +++ b/migration/postcopy-ram.c > @@ -1301,12 +1301,13 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState > *mis, void *host_addr, > */ > if (g_tree_lookup(mis->page_requested, host_addr)) { > g_tree_remove(mis->page_requested, host_addr); > - mis->page_requested_count--; > + int left_pages = qatomic_dec_fetch(&mis->page_requested_count); > + > trace_postcopy_page_req_del(host_addr, > mis->page_requested_count); > /* Order the update of count and read of preempt status */ > smp_mb(); > - if (qatomic_read(&mis->preempt_thread_status) == > - PREEMPT_THREAD_QUIT) { > + if (mis->preempt_thread_status == PREEMPT_THREAD_QUIT && > + left_pages == 0) { > /* > * This probably means the main thread is waiting for us. > * Notify that we've finished receiving the last requested > -- > 2.41.0
This fixes the issue. It looks ok to me. Do you want me to send a separate patch with it?