Gerard Lynch created CLOUDSTACK-3421:
----------------------------------------

             Summary: When hypervisor is down, no HA occurs with log output 
"Agent state cannot be determined, do nothing"
                 Key: CLOUDSTACK-3421
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: KVM, Management Server
    Affects Versions: 4.1.0
         Environment: CentOS 6.4 minimal install
Libvirt, KVM/Qemu
CloudStack 4.1
GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount Point

3 physical servers
* 1 management server, running NFS secondary storage
** 1 nic, management+storage
* 2 hypervisor nodes, running glusterfs-server 
** 4x nic, management+storage, public, guest, gluster peering
* Advanced zone
* KVM
* 4 networks: 
 eth0: cloudbr0: management+secondary storage, 
 eth2: cloudbr1: public
 eth3: cloudbr2: guest
 eth1: gluster peering
* Shared Mount Point
* custom network offering with redundant routers enabled
* global settings tweaked to increase speed of identifying down state
** ping.interval: 10sec
            Reporter: Gerard Lynch
            Priority: Critical
             Fix For: 4.1.1, 4.2.0, Future


We wanted to test CloudStack's HA capabilities by simulating outages to find 
out how long it would take to recover.  One of the tests was simulating loss of 
a hypervisor node by shutting it down.   When we tested this, we found that 
CloudStack failed to bring up any of the VMs (System or Instance), which were 
on the down node, until the node was powered back up and reconnected.

In the logs, we see repeating occurances of:

INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find 
exception: com.cloud.exception.OperationTimedoutException in error code list 
for exceptions
INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not find 
exception: com.cloud.exception.OperationTimedoutException in error code list 
for exceptions
WARN  [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: Timed 
out on Seq 14-660013135:  { Cmd , MgmtId: 93515041483, via: 14, Ver: v1, Flags: 
100011, [{"CheckHealthCommand":{"wait":50}}] }
WARN  [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: Timed 
out on Seq 15-1097531400:  { Cmd , MgmtId: 93515041483, via: 15, Ver: v1, 
Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed out: 
Commands 660013135 to Host 14 timed out after 100
WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed out: 
Commands 1097531400 to Host 15 timed out after 100
WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot 
be determined, do nothing
WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot 
be determined, do nothing


To reproduce: 
1. Build the environment as detailed above
2. Register an ISO
3. Create a new guest network using the custom network offering (that offers 
redundant routers)
3. Provision an instance
4. Ensure the system VMs and instance are on the first hypervisor node
5. Shutdown the first hypervisor node (or pull the plug)

Expected result:
  All system VMs and instance(s) should be brought up on the 2nd hypervisor 
node.

Actual result:
  We see the first hypervisor node marked "disconnected."
  All System VMs and the Instance are still marked "Running", however ping to 
any of them fails. 
  Ping to the redundant router on the 2nd hypervisor node is still working.

  We see in the logs 

  "INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not 
find exception: com.cloud.exception.OperationTimedoutException in error code 
list for exceptions"

  Followed by

  "Agent state cannot be determined, do nothing"


Searching for "Cloudstack Agent state cannot be determined, do nothing" lead 
to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/

Which caused me some concern, because if I read the logic in the ticket 
correctly... The management server will not perform any HA actions if it's 
unable to determine the state of a hypervisor node.  In the scenario above, 
it's not a loss of connectivity, but an actual outage on the hypervisor... so 
I'd rather like HA to occur.  Split brain is a concern, but I think that 
something along the lines of "if hypervisor can't see management or gateway, 
stop instances)" is more relevant than "do nothing"

I'm hoping this is something really obvious and simple to resolve, because 
otherwise this is a pretty serious issue as currently any accidental shutdown, 
or hardware fault will cause a continuous outage requiring manual action to 
resolve.


Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to