On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote: > On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000, > Wang, Wei W wrote: > >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > >>> When using the post-copy preemption feature to perform post-copy > >>> live migration, the below scenario could lead to a deadlock and the > >>> migration will never finish: > >>> > >>> - Source connect() the preemption channel in postcopy_start(). > >>> - Source and the destination side TCP stack finished the 3-way handshake > >>> thus the connection is successful. > >>> - The destination side main thread is handling the loading of the bulk > RAM > >>> pages thus it doesn't start to handle the pending connection event in > >>> the > >>> event loop. and doesn't post the semaphore > postcopy_qemufile_dst_done for > >>> the preemption thread. > >>> - The source side sends non-iterative device states, such as the virtio > >>> states. > >>> - The destination main thread starts to receive the virtio states, this > >>> process may lead to a page fault (e.g., > >>> virtio_load()->vring_avail_idx() > >>> may trigger a page fault since the avail ring page may not be received > >>> yet). > > > > Ouch. Yeah I think this part got overlooked when working on the > > preempt channel. > > > >>> - The page request is sent back to the source side. Source sends the page > >>> content to the destination side preemption thread. > >>> - Since the event is not arrived and the semaphore > >>> postcopy_qemufile_dst_done is not posted, the preemption thread in > >>> destination side is blocked, and cannot handle receiving the page. > >>> - The QEMU main load thread on the destination side is stuck at the page > >>> fault, and cannot yield and handle the connect() event for the > >>> preemption channel to unblock the preemption thread. > >>> - The postcopy will stuck there forever since this is a deadlock. > >>> > >>> The key point to reproduce this bug is that the source side is > >>> sending pages at a rate faster than the destination handling, > >>> otherwise, the qemu_get_be64() in > >>> ram_load_precopy() will have a chance to yield since at that time > >>> there are no pending data in the buffer to get. This will make this > >>> bug harder to be reproduced. > > > > How hard would this reproduce? > > We can manually make this easier to reproduce by adding the following code > to make the destination busier to load the pages: > > diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f > 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) { > MigrationIncomingState *mis = migration_incoming_get_current(); > int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0; > + volatile unsigned long long a; > > if (!migrate_compress()) { > invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6 > +4348,7 @@ static int ram_load_precopy(QEMUFile *f) > break; > > case RAM_SAVE_FLAG_PAGE: > + for (a = 0; a < 100000000; a++); > qemu_get_buffer(f, host, TARGET_PAGE_SIZE); > break; >
Which version of QEMU are you using? I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's always reproducible without any changes (with local migration tests). > > > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty > > late for 9.0 though, but we can still discuss. > > > >>> > >>> Fix this by yielding the load coroutine when receiving > >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > >>> connection event before loading the non-iterative devices state to > >>> avoid the deadlock condition. > >>> > >>> Signed-off-by: Lei Wang <lei4.w...@intel.com> > >> > >> This seems to be a regression issue caused by this commit: > >> 737840e2c6ea (migration: Use the number of transferred bytes > >> directly) > >> > >> Adding qemu_fflush back to migration_rate_exceeded() or > >> ram_save_iterate seems to work (might not be a good fix though). > >> > >>> --- > >>> migration/savevm.c | 5 +++++ > >>> 1 file changed, 5 insertions(+) > >>> > >>> diff --git a/migration/savevm.c b/migration/savevm.c index > >>> e386c5267f..8fd4dc92f2 100644 > >>> --- a/migration/savevm.c > >>> +++ b/migration/savevm.c > >>> @@ -2445,6 +2445,11 @@ static int > loadvm_process_command(QEMUFile *f) > >>> return loadvm_postcopy_handle_advise(mis, len); > >>> > >>> case MIG_CMD_POSTCOPY_LISTEN: > >>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > >>> + aio_co_schedule(qemu_get_current_aio_context(), > >>> + qemu_coroutine_self()); > >>> + qemu_coroutine_yield(); > >>> + } > >> > >> The above could be moved to loadvm_postcopy_handle_listen(). > > > > I'm not 100% sure such thing (no matter here or moved into it, which > > does look cleaner) would work for us. > > > > The problem is I still don't yet see an ordering restricted on top of > > (1) > > accept() happens, and (2) receive LISTEN cmd here. What happens if > > the > > accept() request is not yet received when reaching LISTEN? Or is it > > always guaranteed the accept(fd) will always be polled here? > > > > For example, the source QEMU (no matter pre-7.2 or later) will always > > setup the preempt channel asynchrounously, then IIUC it can connect() > > after sending the whole chunk of packed data which should include this > > LISTEN. I think it means it's not guaranteed this will 100% work, but > > maybe further reduce the possibility of the race. > > I think the following code: > > postcopy_start() -> > postcopy_preempt_establish_channel() -> > qemu_sem_wait(&s->postcopy_qemufile_src_sem); > > can guarantee that the connect() syscall is successful so the destination side > receives the connect() request before it loads the LISTEN command, otherwise > it won't post the sem: > > postcopy_preempt_send_channel_new() -> > postcopy_preempt_send_channel_done() -> > qemu_sem_post(&s->postcopy_qemufile_src_sem); > Yes. But as mentioned in another thread, connect() and accept() are async. So in theory accept() could still come later after the LISTEN command. > > > > One right fix that I can think of is moving the sem_wait(&done) into > > the main thread too, so we wait for the sem _before_ reading the > > packed data, so there's no chance of fault. However I don't think > > sem_wait() will be smart enough to yield when in a coroutine.. In the > > long term run I think we should really make migration loadvm to do > > work in the thread rather than the main thread. I think it means we > > have one more example to be listed in this todo so that's preferred.. > > > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration > > _destination > > > > I attached such draft patch below, but I'm not sure it'll work. Let > > me know how both of you think about it. > > Sadly it doesn't work, there is an unknown segfault. > > > > >> > >> Another option is to follow the old way (i.e. pre_7_2) to do > >> postcopy_preempt_setup in migrate_fd_connect. This can save the above > >> overhead of switching to the main thread during the downtime. Seems > >> Peter's previous patch already solved the channel disordering issue. Let's > see Peter and others' opinions. > > > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() > > to make sure the ordering is done properly. Wei, could you elaborate > > the patch you mentioned? Maybe I missed some spots. > > > > You raised a good point that this may introduce higher downtime. Did > > you or Lei tried to measure how large it is? If that is too high, we > > may need to think another solution, e.g., wait the channel connection > > before vm stop happens. > > Per my very simple test, using post-copy preemption to live migrate an 8G VM: > > w/o this patch: 121ms in avg in 5 tries > w/ this patch: 115ms in avg in 5 tries > > So it seems the overhead introduced is not too high (maybe ignorable?). You could just measure the time for the added qemu_coroutine_yield() part. The time will depend on how many events happen to be there waiting for a dispatch.