On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote:
> On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000,
> Wang, Wei W wrote:
> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> >>> When using the post-copy preemption feature to perform post-copy
> >>> live migration, the below scenario could lead to a deadlock and the
> >>> migration will never finish:
> >>>
> >>>  - Source connect() the preemption channel in postcopy_start().
> >>>  - Source and the destination side TCP stack finished the 3-way handshake
> >>>    thus the connection is successful.
> >>>  - The destination side main thread is handling the loading of the bulk
> RAM
> >>>    pages thus it doesn't start to handle the pending connection event in 
> >>> the
> >>>    event loop. and doesn't post the semaphore
> postcopy_qemufile_dst_done for
> >>>    the preemption thread.
> >>>  - The source side sends non-iterative device states, such as the virtio
> >>>    states.
> >>>  - The destination main thread starts to receive the virtio states, this
> >>>    process may lead to a page fault (e.g., 
> >>> virtio_load()->vring_avail_idx()
> >>>    may trigger a page fault since the avail ring page may not be received
> >>>    yet).
> >
> > Ouch.  Yeah I think this part got overlooked when working on the
> > preempt channel.
> >
> >>>  - The page request is sent back to the source side. Source sends the page
> >>>    content to the destination side preemption thread.
> >>>  - Since the event is not arrived and the semaphore
> >>>    postcopy_qemufile_dst_done is not posted, the preemption thread in
> >>>    destination side is blocked, and cannot handle receiving the page.
> >>>  - The QEMU main load thread on the destination side is stuck at the page
> >>>    fault, and cannot yield and handle the connect() event for the
> >>>    preemption channel to unblock the preemption thread.
> >>>  - The postcopy will stuck there forever since this is a deadlock.
> >>>
> >>> The key point to reproduce this bug is that the source side is
> >>> sending pages at a rate faster than the destination handling,
> >>> otherwise, the qemu_get_be64() in
> >>> ram_load_precopy() will have a chance to yield since at that time
> >>> there are no pending data in the buffer to get. This will make this
> >>> bug harder to be reproduced.
> >
> > How hard would this reproduce?
> 
> We can manually make this easier to reproduce by adding the following code
> to make the destination busier to load the pages:
> 
> diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f
> 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f)  {
>      MigrationIncomingState *mis = migration_incoming_get_current();
>      int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
> +    volatile unsigned long long a;
> 
>      if (!migrate_compress()) {
>          invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6
> +4348,7 @@ static int ram_load_precopy(QEMUFile *f)
>              break;
> 
>          case RAM_SAVE_FLAG_PAGE:
> +            for (a = 0; a < 100000000; a++);
>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>              break;
> 

Which version of QEMU are you using?
I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's
always reproducible without any changes (with local migration tests).


> >
> > I'm thinking whether this should be 9.0 material or 9.1.  It's pretty
> > late for 9.0 though, but we can still discuss.
> >
> >>>
> >>> Fix this by yielding the load coroutine when receiving
> >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> >>> connection event before loading the non-iterative devices state to
> >>> avoid the deadlock condition.
> >>>
> >>> Signed-off-by: Lei Wang <lei4.w...@intel.com>
> >>
> >> This seems to be a regression issue caused by this commit:
> >> 737840e2c6ea (migration: Use the number of transferred bytes
> >> directly)
> >>
> >> Adding qemu_fflush back to migration_rate_exceeded() or
> >> ram_save_iterate seems to work (might not be a good fix though).
> >>
> >>> ---
> >>>  migration/savevm.c | 5 +++++
> >>>  1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/migration/savevm.c b/migration/savevm.c index
> >>> e386c5267f..8fd4dc92f2 100644
> >>> --- a/migration/savevm.c
> >>> +++ b/migration/savevm.c
> >>> @@ -2445,6 +2445,11 @@ static int
> loadvm_process_command(QEMUFile *f)
> >>>          return loadvm_postcopy_handle_advise(mis, len);
> >>>
> >>>      case MIG_CMD_POSTCOPY_LISTEN:
> >>> +        if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> >>> +            aio_co_schedule(qemu_get_current_aio_context(),
> >>> +                            qemu_coroutine_self());
> >>> +            qemu_coroutine_yield();
> >>> +        }
> >>
> >> The above could be moved to loadvm_postcopy_handle_listen().
> >
> > I'm not 100% sure such thing (no matter here or moved into it, which
> > does look cleaner) would work for us.
> >
> > The problem is I still don't yet see an ordering restricted on top of
> > (1)
> > accept() happens, and (2) receive LISTEN cmd here.  What happens if
> > the
> > accept() request is not yet received when reaching LISTEN?  Or is it
> > always guaranteed the accept(fd) will always be polled here?
> >
> > For example, the source QEMU (no matter pre-7.2 or later) will always
> > setup the preempt channel asynchrounously, then IIUC it can connect()
> > after sending the whole chunk of packed data which should include this
> > LISTEN.  I think it means it's not guaranteed this will 100% work, but
> > maybe further reduce the possibility of the race.
> 
> I think the following code:
> 
> postcopy_start() ->
>       postcopy_preempt_establish_channel() ->
>               qemu_sem_wait(&s->postcopy_qemufile_src_sem);
> 
> can guarantee that the connect() syscall is successful so the destination side
> receives the connect() request before it loads the LISTEN command, otherwise
> it won't post the sem:
> 
> postcopy_preempt_send_channel_new() ->
>       postcopy_preempt_send_channel_done() ->
>               qemu_sem_post(&s->postcopy_qemufile_src_sem);
> 

Yes. But as mentioned in another thread, connect() and accept() are async.
So in theory accept() could still come later after the LISTEN command.

> >
> > One right fix that I can think of is moving the sem_wait(&done) into
> > the main thread too, so we wait for the sem _before_ reading the
> > packed data, so there's no chance of fault.  However I don't think
> > sem_wait() will be smart enough to yield when in a coroutine..  In the
> > long term run I think we should really make migration loadvm to do
> > work in the thread rather than the main thread.  I think it means we
> > have one more example to be listed in this todo so that's preferred..
> >
> > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration
> > _destination
> >
> > I attached such draft patch below, but I'm not sure it'll work.  Let
> > me know how both of you think about it.
> 
> Sadly it doesn't work, there is an unknown segfault.
> 
> >
> >>
> >> Another option is to follow the old way (i.e. pre_7_2) to do
> >> postcopy_preempt_setup in migrate_fd_connect. This can save the above
> >> overhead of switching to the main thread during the downtime. Seems
> >> Peter's previous patch already solved the channel disordering issue. Let's
> see Peter and others' opinions.
> >
> > IIUC we still need that pre_7_2 stuff and keep the postponed connect()
> > to make sure the ordering is done properly.  Wei, could you elaborate
> > the patch you mentioned?  Maybe I missed some spots.
> >
> > You raised a good point that this may introduce higher downtime.  Did
> > you or Lei tried to measure how large it is?  If that is too high, we
> > may need to think another solution, e.g., wait the channel connection
> > before vm stop happens.
> 
> Per my very simple test, using post-copy preemption to live migrate an 8G VM:
> 
>     w/o this patch: 121ms in avg in 5 tries
>     w/ this patch: 115ms in avg in 5 tries
> 
> So it seems the overhead introduced is not too high (maybe ignorable?).

You could just measure the time for the added qemu_coroutine_yield() part.
The time will depend on how many events happen to be there waiting for a 
dispatch.

Reply via email to