Gerard Lynch created CLOUDSTACK-3421:
----------------------------------------
Summary: When hypervisor is down, no HA occurs with log output
"Agent state cannot be determined, do nothing"
Key: CLOUDSTACK-3421
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
Project: CloudStack
Issue Type: Bug
Security Level: Public (Anyone can view this level - this is the default.)
Components: KVM, Management Server
Affects Versions: 4.1.0
Environment: CentOS 6.4 minimal install
Libvirt, KVM/Qemu
CloudStack 4.1
GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount Point
3 physical servers
* 1 management server, running NFS secondary storage
** 1 nic, management+storage
* 2 hypervisor nodes, running glusterfs-server
** 4x nic, management+storage, public, guest, gluster peering
* Advanced zone
* KVM
* 4 networks:
eth0: cloudbr0: management+secondary storage,
eth2: cloudbr1: public
eth3: cloudbr2: guest
eth1: gluster peering
* Shared Mount Point
* custom network offering with redundant routers enabled
* global settings tweaked to increase speed of identifying down state
** ping.interval: 10sec
Reporter: Gerard Lynch
Priority: Critical
Fix For: 4.1.1, 4.2.0, Future
We wanted to test CloudStack's HA capabilities by simulating outages to find
out how long it would take to recover. One of the tests was simulating loss of
a hypervisor node by shutting it down. When we tested this, we found that
CloudStack failed to bring up any of the VMs (System or Instance), which were
on the down node, until the node was powered back up and reconnected.
In the logs, we see repeating occurances of:
INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find
exception: com.cloud.exception.OperationTimedoutException in error code list
for exceptions
INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not find
exception: com.cloud.exception.OperationTimedoutException in error code list
for exceptions
WARN [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: Timed
out on Seq 14-660013135: { Cmd , MgmtId: 93515041483, via: 14, Ver: v1, Flags:
100011, [{"CheckHealthCommand":{"wait":50}}] }
WARN [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: Timed
out on Seq 15-1097531400: { Cmd , MgmtId: 93515041483, via: 15, Ver: v1,
Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed out:
Commands 660013135 to Host 14 timed out after 100
WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed out:
Commands 1097531400 to Host 15 timed out after 100
WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot
be determined, do nothing
WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot
be determined, do nothing
To reproduce:
1. Build the environment as detailed above
2. Register an ISO
3. Create a new guest network using the custom network offering (that offers
redundant routers)
3. Provision an instance
4. Ensure the system VMs and instance are on the first hypervisor node
5. Shutdown the first hypervisor node (or pull the plug)
Expected result:
All system VMs and instance(s) should be brought up on the 2nd hypervisor
node.
Actual result:
We see the first hypervisor node marked "disconnected."
All System VMs and the Instance are still marked "Running", however ping to
any of them fails.
Ping to the redundant router on the 2nd hypervisor node is still working.
We see in the logs
"INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not
find exception: com.cloud.exception.OperationTimedoutException in error code
list for exceptions"
Followed by
"Agent state cannot be determined, do nothing"
Searching for "Cloudstack Agent state cannot be determined, do nothing" lead
to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/
Which caused me some concern, because if I read the logic in the ticket
correctly... The management server will not perform any HA actions if it's
unable to determine the state of a hypervisor node. In the scenario above,
it's not a loss of connectivity, but an actual outage on the hypervisor... so
I'd rather like HA to occur. Split brain is a concern, but I think that
something along the lines of "if hypervisor can't see management or gateway,
stop instances)" is more relevant than "do nothing"
I'm hoping this is something really obvious and simple to resolve, because
otherwise this is a pretty serious issue as currently any accidental shutdown,
or hardware fault will cause a continuous outage requiring manual action to
resolve.
Thanks
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira