On Fri, Oct 07, 2016 at 08:38:03PM +0100, 'Viktor Bachraty' via ganeti-devel 
> In case of Xen migrations, the most common failure case is when the
> instance fails to freeze so the migration fails with domains running on
> both target and source node. This patchs allows migrate --cleanup to
> recover by running AbortMigrate() in case the instance is running on
> both the source and target node.

I look forward to never having to clean up after this again. However there's
a problem with this patch: the leaky abstraction doesn't deal with differnces
between xm and xl. See below.

> +      list of strings, node uuids
> +    """
> +    self.feedback_fn("* checking where the instance actually runs  (if this"
> +                     " hangs, the hypervisor might be in a bad state)")

Whitespace. s/runs  (/runs (/

> +    # Verify each result and raise an exception if failed
> +    for node_uuid, result in instance_list.items():
> +      result.Raise("Can't contact node %s" % node_uuid)

I know this is copy & pasted from the original, but would it be more useful for
this error to report self.cfg.GetNodeName(node_uuid)?

> +    # Xen renames the instance during migration, unfortunately we don't have
> +    # a nicer way of identifying that it's the same instance. This is an 
> awful
> +    # leaking abstraction.
> +    tname = 'migrating-' + name

This is partly true, but unfortunately this leaky abstraction is a little
broken. xm and xl handle this a bit differently:

xm behaviour: (in tools/python/xen/xend/XendCheckpoint.py save() & restore())
source: during:   migrating-somedomain
        finalize: <none>
        done:     <none>

target: during:   somedomain
        finalize: somedomain
        done:     somedomain

xl behaviour: (in tools/libxl/xl_cmdimpl.c migrate_domain() & migrate_receive())
source: copy:     somedomain
        finalize: somedomain--migratedaway
        done:     <none>

target: copy:     somedomain--incoming
        finalize: somedomain
        done:     somedomain

NB this doesn't affect patch 3/7's 'rename migrating-' instances fix in
FinalizeMigrationSource, because there won't be any, and it'll be a no-op.
However, you should probably update the comment to say that it only applies to
xm, and I think there might be an analogous problem we might want to handle if
xl fails to clean up the --migratedaway instance on the source node.

Reply via email to