On Fri, Oct 07, 2016 at 08:38:05PM +0100, 'Viktor Bachraty' via ganeti-devel
> In some failure modes, Ganeti state of record may desync from actual
> state of world. This patch allows migrate --cleanup to adopt an instance
> if it is detected running on an unexpected node.
I'm trying to imagine how this might happen, but can't think how. Can you give
a brief example?
> Signed-off-by: Viktor Bachraty <vbachr...@google.com>
> lib/cmdlib/instance_migration.py | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
> diff --git a/lib/cmdlib/instance_migration.py
> index 423a08b..cac0f5e 100644
> --- a/lib/cmdlib/instance_migration.py
> +++ b/lib/cmdlib/instance_migration.py
> @@ -611,8 +611,9 @@ class TLMigrateInstance(Tasklet):
> " hangs, the hypervisor might be in a bad state)")
> cluster_hvparams = self.cfg.GetClusterInfo().hvparams
> + online_node_uuids = self.cfg.GetOnlineNodeList()
> instance_list = self.rpc.call_instance_list(
> - self.all_node_uuids, [self.instance.hypervisor], cluster_hvparams)
> + online_node_uuids, [self.instance.hypervisor], cluster_hvparams)
I'm not sure if this is possible, but I thought I'd ask: Since you're only
checking online nodes rather than all nodes, what happens if a node goes offline
(because of a connectivity issue to the master, for example), but the instance
on it is still running. Are we going to end up with an orphaned, still running
> # Verify each result and raise an exception if failed
> for node_uuid, result in instance_list.items():
> @@ -678,10 +679,16 @@ class TLMigrateInstance(Tasklet):
> " and restart this operation")
> if not (runningon_source or runningon_target):
> - raise errors.OpExecError("Instance does not seem to be running at all;"
> - " in this case it's safer to repair by"
> - " running 'gnt-instance stop' to ensure disk"
> - " shutdown, and then restarting it")
> + if len(instance_locations) == 1:
> + # The instance is running on a differrent node than expected, let's
> + # adopt it as if it was running on the secondary
> + self.target_node_uuid = instance_locations
Cool. But we should probably report this case with self.feedback_fn.
> + runningon_target = True
> + else:
> + raise errors.OpExecError("Instance does not seem to be running at
> + " in this case it's safer to repair by"
> + " running 'gnt-instance stop' to ensure
> + " shutdown, and then restarting it")
> if runningon_target:
> # the migration has actually succeeded, we need to update the config