On Fri, Oct 07, 2016 at 08:38:05PM +0100, 'Viktor Bachraty' via ganeti-devel 
wrote:
> In some failure modes, Ganeti state of record may desync from actual
> state of world. This patch allows migrate --cleanup to adopt an instance
> if it is detected running on an unexpected node.

I'm trying to imagine how this might happen, but can't think how. Can you give
a brief example?

> Signed-off-by: Viktor Bachraty <vbachr...@google.com>
> ---
>  lib/cmdlib/instance_migration.py | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/cmdlib/instance_migration.py 
> b/lib/cmdlib/instance_migration.py
> index 423a08b..cac0f5e 100644
> --- a/lib/cmdlib/instance_migration.py
> +++ b/lib/cmdlib/instance_migration.py
> @@ -611,8 +611,9 @@ class TLMigrateInstance(Tasklet):
>                       " hangs, the hypervisor might be in a bad state)")
>  
>      cluster_hvparams = self.cfg.GetClusterInfo().hvparams
> +    online_node_uuids = self.cfg.GetOnlineNodeList()
>      instance_list = self.rpc.call_instance_list(
> -        self.all_node_uuids, [self.instance.hypervisor], cluster_hvparams)
> +        online_node_uuids, [self.instance.hypervisor], cluster_hvparams)

I'm not sure if this is possible, but I thought I'd ask: Since you're only
checking online nodes rather than all nodes, what happens if a node goes offline
(because of a connectivity issue to the master, for example), but the instance
on it is still running. Are we going to end up with an orphaned, still running
domain?

>      # Verify each result and raise an exception if failed
>      for node_uuid, result in instance_list.items():
> @@ -678,10 +679,16 @@ class TLMigrateInstance(Tasklet):
>                                 " and restart this operation")
>  
>      if not (runningon_source or runningon_target):
> -      raise errors.OpExecError("Instance does not seem to be running at all;"
> -                               " in this case it's safer to repair by"
> -                               " running 'gnt-instance stop' to ensure disk"
> -                               " shutdown, and then restarting it")
> +      if len(instance_locations) == 1:
> +        # The instance is running on a differrent node than expected, let's
> +        # adopt it as if it was running on the secondary
> +        self.target_node_uuid = instance_locations[0]

Cool. But we should probably report this case with self.feedback_fn.

> +        runningon_target = True
> +      else:
> +        raise errors.OpExecError("Instance does not seem to be running at 
> all;"
> +                                 " in this case it's safer to repair by"
> +                                 " running 'gnt-instance stop' to ensure 
> disk"
> +                                 " shutdown, and then restarting it")
>  
>      if runningon_target:
>        # the migration has actually succeeded, we need to update the config
> -- 
> 2.8.0.rc3.226.g39d4020
> 

Reply via email to