[
https://issues.apache.org/jira/browse/CLOUDSTACK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gerard Lynch updated CLOUDSTACK-3421:
-------------------------------------
Attachment: catalina_management-server.zip
Attached our management server catalina.out and management-server.log files.
Let me know if you require anything further.
> When hypervisor is down, no HA occurs with log output "Agent state cannot be
> determined, do nothing"
> ----------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-3421
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: KVM, Management Server
> Affects Versions: 4.1.0
> Environment: CentOS 6.4 minimal install
> Libvirt, KVM/Qemu
> CloudStack 4.1
> GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount
> Point
> 3 physical servers
> * 1 management server, running NFS secondary storage
> ** 1 nic, management+storage
> * 2 hypervisor nodes, running glusterfs-server
> ** 4x nic, management+storage, public, guest, gluster peering
> * Advanced zone
> * KVM
> * 4 networks:
> eth0: cloudbr0: management+secondary storage,
> eth2: cloudbr1: public
> eth3: cloudbr2: guest
> eth1: gluster peering
> * Shared Mount Point
> * custom network offering with redundant routers enabled
> * global settings tweaked to increase speed of identifying down state
> ** ping.interval: 10sec
> Reporter: Gerard Lynch
> Priority: Critical
> Fix For: 4.1.1, 4.2.0, Future
>
> Attachments: catalina_management-server.zip
>
>
> We wanted to test CloudStack's HA capabilities by simulating outages to find
> out how long it would take to recover. One of the tests was simulating loss
> of a hypervisor node by shutting it down. When we tested this, we found
> that CloudStack failed to bring up any of the VMs (System or Instance), which
> were on the down node, until the node was powered back up and reconnected.
> In the logs, we see repeating occurances of:
> INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not
> find exception: com.cloud.exception.OperationTimedoutException in error code
> list for exceptions
> INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not
> find exception: com.cloud.exception.OperationTimedoutException in error code
> list for exceptions
> WARN [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135:
> Timed out on Seq 14-660013135: { Cmd , MgmtId: 93515041483, via: 14, Ver:
> v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400:
> Timed out on Seq 15-1097531400: { Cmd , MgmtId: 93515041483, via: 15, Ver:
> v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed
> out: Commands 660013135 to Host 14 timed out after 100
> WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed
> out: Commands 1097531400 to Host 15 timed out after 100
> WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot
> be determined, do nothing
> WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot
> be determined, do nothing
> To reproduce:
> 1. Build the environment as detailed above
> 2. Register an ISO
> 3. Create a new guest network using the custom network offering (that offers
> redundant routers)
> 3. Provision an instance
> 4. Ensure the system VMs and instance are on the first hypervisor node
> 5. Shutdown the first hypervisor node (or pull the plug)
> Expected result:
> All system VMs and instance(s) should be brought up on the 2nd hypervisor
> node.
> Actual result:
> We see the first hypervisor node marked "disconnected."
> All System VMs and the Instance are still marked "Running", however ping to
> any of them fails.
> Ping to the redundant router on the 2nd hypervisor node is still working.
> We see in the logs
> "INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not
> find exception: com.cloud.exception.OperationTimedoutException in error code
> list for exceptions"
> Followed by
> "Agent state cannot be determined, do nothing"
> Searching for "Cloudstack Agent state cannot be determined, do nothing" lead
> to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/
> Which caused me some concern, because if I read the logic in the ticket
> correctly... The management server will not perform any HA actions if it's
> unable to determine the state of a hypervisor node. In the scenario above,
> it's not a loss of connectivity, but an actual outage on the hypervisor... so
> I'd rather like HA to occur. Split brain is a concern, but I think that
> something along the lines of "if hypervisor can't see management or gateway,
> stop instances)" is more relevant than "do nothing"
> I'm hoping this is something really obvious and simple to resolve, because
> otherwise this is a pretty serious issue as currently any accidental
> shutdown, or hardware fault will cause a continuous outage requiring manual
> action to resolve.
> Thanks
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira