[
https://issues.apache.org/jira/browse/CLOUDSTACK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029099#comment-14029099
]
Koushik Das commented on CLOUDSTACK-6857:
-----------------------------------------
Can you share the full logs? Based on the log snippet none of the available
investigators were able to determine if VM is alive. In such a case something
called 'fencers' tries to fence off the VM. If fencers fail nothing is done to
the VM. Full logs will help understand what all happened.
> Losing the connection from CloudStack Manager to the agent will force a
> shutdown when connection is re-established
> ------------------------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-6857
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6857
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.3.0
> Environment: Ubuntu 12.04
> Reporter: c-hemp
> Priority: Critical
>
> If a physical host is not pingable that host goes into alert mode. If the
> physical hosts is unreachable, the virtual router is either unreachable or
> unable to ping a virtual on the physical host, and the manager is unable to
> ping the virtual instance it assumes the virtual is down and puts it into a
> stop state.
> When the connection is restablished, it gets the state from the database,
> sees that it is now in a stopped state, and will then shutdown the instance.
> This behavior can cause major outages if there is any type of network loss
> once the connectivity comes back. This is especially critical when using
> CloudStack across multiple colos.
> The logs when it happens:
> 14-06-06 02:01:22,259 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine
> state of VM[User|cephvmstage013] returning null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is
> alive
> 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot
> ping this system VM, unable to determine state of VM[User|cephvmstage013]
> returning null
> 2014-06-06 02:01:22,260 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,584 WARN [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop
> VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on
> the host. Proceeding to release resource held.
> 2014-06-06 02:01:22,648 WARN [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop
> VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on
> the host. Proceeding to release resource held.
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl]
> (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources
> for the vm VM[User|cephvmstage013]
> The behavior should change to be set into an alert state, then once
> connectivity is re-established, if the instance is up, update the manager
> with the running status
--
This message was sent by Atlassian JIRA
(v6.2#6252)