On Wed, Aug 02, 2017 at 01:56:46PM +0800, Peter Xu wrote:
> On Tue, Aug 01, 2017 at 12:03:48PM +0100, Daniel P. Berrange wrote:
> > On Fri, Jul 28, 2017 at 04:06:25PM +0800, Peter Xu wrote:
> > > It will be used when we want to resume one paused migration.
> > > 
> > > Signed-off-by: Peter Xu <pet...@redhat.com>
> > > ---
> > >  hmp-commands.hx       | 7 ++++---
> > >  hmp.c                 | 4 +++-
> > >  migration/migration.c | 2 +-
> > >  qapi-schema.json      | 5 ++++-
> > >  4 files changed, 12 insertions(+), 6 deletions(-)
> > 
> > I'm not seeing explicit info about how we handle the original failure
> > and how it relates to this resume command, but this feels like a
> > potentially racy approach to me.
> > 
> > If we have a network problem between source & target, we could see
> > two results. Either the TCP stream will simply hang (it'll still
> > appear open to QEMU but no traffic will be flowing),
> 
> (let's say this is the "1st condition")
> 
> > or the connection
> > may actually break such that we get EOF and end up closing the file
> > descriptor.
> 
> (let's say this is the "2nd condition")
> 
> > 
> > In the latter case, we're ok because the original channel is now
> > gone and we can safely establish the new one by issuing the new
> > 'migrate --resume URI' command.
> > 
> > In the former case, however, there is the possibility that the
> > hang may come back to life at some point, concurrently with us
> > trying to do 'migrate --resume URI' and I'm unclear on the
> > semantics if that happens.
> > 
> > Should the original connection carry on, and thus cause the
> > 'migrate --resume' command to fail, or will we forcably terminate
> > the original connection no matter what and use the new "resumed"
> > connection.
> 
> Hmm yes, this is a good question. Currently this series is only
> handling the 2nd condition, say, when we can detect the error via
> system calls (IIUC we can know nothing when the 1st condition is
> encountered, we just e.g. block at the system calls as usual when
> reading the file handle). And currently the "resume" command is only
> allowed if the 2nd condition is detected (so it will never destroy an
> existing channel).
> 
> If you see the next following patch, there is something like:
> 
>     if (has_resume && resume) {
>         if (s->state != MIGRATION_STATUS_POSTCOPY_PAUSED) {
>             error_setg(errp, "Cannot resume if there is no "
>                        "paused migration");
>             return;
>         }
>         goto do_resume;
>     }
> 
> And here MIGRATION_STATUS_POSTCOPY_PAUSED will only be set when the
> 2nd condition is met.
> 
> > 
> > There's also synchronization with the target host - at the time we
> > want to recover, we need to be able to tell the target to accept
> > new incoming clients again, but we don't want to do that if the
> > original connection comes back to life.
> 
> Yeah, I hacked this part in this v1 series (as you may have seen) to
> keep the ports open-forever. I am not sure whether that is acceptable,
> but looks not. :)
> 
> How about this: when destination detected 2nd condition, it firstly
> switch to "postcopy-pause" state, then re-opens the accept channels.
> And it can turns the accept channels off when the state moves out of
> "postcopy-pause".
> 
> > 
> > It feels to me that if the mgmt app or admin believes the migration
> > is in a stuck state, we should be able to explicitly terminate the
> > existing connection via a monitor command. Then setup the target
> > host to accept new client, and then issue this migrate resume on
> > the source.
> 
> Totally agree. That should be the only way to handle 1st condition
> well. However, would you mind if I postpone it a bit? IMHO as long as
> we can solve the 2nd condition nicely (which is the goal of this
> series), then it won't be too hard to continue support the 1st
> condition.

Sure, the 1st scenario is an easy bolt on to the second scenario. I
just wanted to be clear about what the target of these patches is,
because I think the 1st scenario is probably the most common one.

I guess if you have TCP keepalives enabled with a reasonably short
timeout, the 1st scenario will turn into the 2nd scenario fairly
quickly.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Reply via email to