* Kevin Wolf (kw...@redhat.com) wrote:
> Am 10.04.2018 um 09:36 hat Jiri Denemark geschrieben:
> > On Mon, Apr 09, 2018 at 15:40:03 +0200, Kevin Wolf wrote:
> > > Am 09.04.2018 um 12:27 hat Dr. David Alan Gilbert geschrieben:
> > > > It's a fairly hairy failure case they had; if I remember correctly it's:
> > > >   a) Start migration
> > > >   b) Migration gets to completion point
> > > >   c) Destination is still paused
> > > >   d) Libvirt is restarted on the source
> > > >   e) Since libvirt was restarted it fails the migration (and hence knows
> > > >      the destination won't be started)
> > > >   f) It now tries to resume the qemu on the source
> > > > 
> > > > (f) fails because (b) caused the locks to be taken on the destination;
> > > > hence this patch stops doing that.  It's a case we don't really think
> > > > about - i.e. that the migration has actually completed and all the data
> > > > is on the destination, but libvirt decides for some other reason to
> > > > abandon migration.
> > > 
> > > If you do remember correctly, that scenario doesn't feel tricky at all.
> > > libvirt needs to quit the destination qemu, which will inactivate the
> > > images on the destination and release the lock, and then it can continue
> > > the source.
> > > 
> > > In fact, this is so straightforward that I wonder what else libvirt is
> > > doing. Is the destination qemu only shut down after trying to continue
> > > the source? That would be libvirt using the wrong order of steps.
> > 
> > There's no connection between the two libvirt daemons in the case we're
> > talking about so they can't really synchronize the actions. The
> > destination daemon will kill the new QEMU process and the source will
> > resume the old one, but the order is completely random.
> 
> Hm, okay...
> 
> > > > Yes it was a 'block-activate' that I'd wondered about.  One complication
> > > > is that if this now under the control of the management layer then we
> > > > should stop asserting when the block devices aren't in the expected
> > > > state and just cleanly fail the command instead.
> > > 
> > > Requiring an explicit 'block-activate' on the destination would be an
> > > incompatible change, so you would have to introduce a new option for
> > > that. 'block-inactivate' on the source feels a bit simpler.
> > 
> > As I said in another email, the explicit block-activate command could
> > depend on a migration capability similarly to how pre-switchover state
> > works.
> 
> Yeah, that's exactly the thing that we wouldn't need if we could use
> 'block-inactivate' on the source instead. It feels a bit wrong to
> design a more involved QEMU interface around the libvirt internals,

It's not necessarily 'libvirt internals' - it's a case of them having to
cope with recovering from failures that happen around migration; it's
not an easy problem, and if they've got a way to stop both sides running
at the same time that's pretty important.

> but
> as long as we implement both sides for symmetry and libvirt just happens
> to pick the destination side for now, I think it's okay.
> 
> By the way, are block devices the only thing that need to be explicitly
> activated? For example, what about qemu_announce_self() for network
> cards, do we need to delay that, too?
> 
> In any case, I think this patch needs to be reverted for 2.12 because
> it's wrong, and then we can create the proper solution in the 2.13
> timefrage.

what case does this break?
I'm a bit wary of reverting this, which fixes a known problem, on the
basis that it causes a theoretical problem.

Dave

> Kevin
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Reply via email to