A robust solution would probably involve Apache Zookeeper (using Curator perhaps) to perform robust distributed locking and/or leader election.
On 7/15/13 3:51 PM, "Chiradeep Vittal" <chiradeep.vit...@citrix.com> wrote: >Indeed HA is very tricky as you note. In the generic case where the MS >cannot communicate with the agent, nothing can be concluded and the MS >does nothing. >I dug this up and posted it to the wiki >https://cwiki.apache.org/confluence/x/dwn8AQ > > >On 7/15/13 1:20 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote: > >>I don't know much about HA in regards to management server/agent >>connectivity, but it seems to me like this is perilous ground. If a >>host loses connection with the management server, it seems to me that >>the management server doesn't have the resources to determine whether >>it should start HA-enabled VMs elsewhere. You could very well end up >>with VMs running in two or three places at once, corrupting them, just >>because a host failed to check in. Maybe the agent was stopped (that >>happens all the time). The management server has no fencing >>capaiblity, hence the messages "I don't know, doing nothing", are the >>correct thing to do. That doesn't seem like it's KVM specific, >>however. >> >>I'm very interested in hearing the details on how this HA was intended >>to work, or how it might be working on other platforms. One solution >>may be to leverage the secondary storage to create locks for VMs, then >>again, when VMs can run without the agent it seems prone to deadlock >>(how does another node take over when another host has the lock, but >>the host seems down, but is actually running the vm?). >> >>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <paul.an...@shapeblue.com> >>wrote: >>> I bumped this from the user list as we've just come across the same >>>issue. >>> >>> CloudStack does not react or even change host status when contact is >>>lost with a KVM host. >>> >>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] >>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning >>>null ('I don't know') >>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] >>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's >>>host, returning that we don't have enough information >>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>>(AgentTaskPool-1:null) null unable to determine the state of the host. >>>Moving on. >>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>>(AgentTaskPool-1:null) null unable to determine the state of the host. >>>Moving on. >>> 2013-07-13 17:53:56,695 WARN [agent.manager.AgentManagerImpl] >>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing >>> >>> HA for KVM is almost useless. >>> >>> I suggest this a blocker for any release until fixed. >>> >>> >>> Regards, >>> >>> Paul Angus >>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus >>> paul.an...@shapeblue.com >>> >>> -----Original Message----- >>> From: Koushik Das [mailto:koushik....@citrix.com] >>> Sent: 12 July 2013 12:21 >>> To: us...@cloudstack.apache.org >>> Subject: RE: cs 4.1 host disconnected status >>> >>> I looked at the logs and none of the existing investigators are able to >>>determine that the host is down. I am not sure if there is a clean way >>>to identify if a host is down in case of KVM. Consider the following >>>cases: >>> >>> 1. Host is actually shutdown >>> 2. Management nic of the host is plugged out of the network but host is >>>up and running >>> >>> There is no clean way to distinguish these cases. Cloudstack should >>>only mark the host as down in the first case. But not sure how one would >>>achieve this. >>> >>> -Koushik >>> >>>> -----Original Message----- >>>> From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>>> Sent: Friday, July 12, 2013 2:39 PM >>>> To: us...@cloudstack.apache.org >>>> Subject: Re: cs 4.1 host disconnected status >>>> >>>> I've simulated crash again and here is the log: >>>> http://thesuki.org/temp/cs.log.txt >>>> I stripped out of there GET requests with api keys. >>>> Server was switched off at 8:36 >>>> >>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das >>>><koushik....@citrix.com>wrote: >>>> >>>> > Looks like the KVM investigator is not able to determine the state >>>> > of the agent. Can you share the full log? >>>> > >>>> > > -----Original Message----- >>>> > > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>>> > > Sent: Thursday, July 11, 2013 7:47 PM >>>> > > To: users >>>> > > Subject: cs 4.1 host disconnected status >>>> > > >>>> > > Hi all. >>>> > > >>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4 >>>> > > (management+node1+node2), OpenIndiana NFS server as primary and >>>> > > secondary storage. >>>> > > and I have the following problem: >>>> > > If I switch one hypervisor node off via ipmi (simulate server >>>> > > crash), it >>>> > never >>>> > > goes to Disconnected status in management. Accordingly, ha-enabled >>>> > > VMs are not restarted on another hypervisor node, because it >>>> > > believes that disconnected node is still online. >>>> > > >>>> > > >>>> > > I get following in management server logs: >>>> > > >>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098: >>>>Processing: >>>> > > { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, >>>> > > [{"Answer":{"result":false,"details": "Unable to ping >>>>computing host, >>>> > > exiting","wait":0}}] } >>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , >>>>MgmtId: >>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } >>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] >>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, >>>> > > returning >>>> > null >>>> > > ('I don't know') >>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] >>>> > > (AgentTaskPool-1:null) could not reach agent, could not reach >>>>agent's >>>> > > host, returning that we don't have enough information >>>> > > 2013-07-11 10:19:16,153 DEBUG >>>> > > [cloud.ha.HighAvailabilityManagerImpl] >>>> > > (AgentTaskPool-1:null) null unable to determine the state of the >>>>host. >>>> > > Moving on. >>>> > > 2013-07-11 10:19:16,153 DEBUG >>>> > > [cloud.ha.HighAvailabilityManagerImpl] >>>> > > (AgentTaskPool-1:null) null unable to determine the state of the >>>>host. >>>> > > Moving on. >>>> > > 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] >>>> > > (AgentTaskPool-1:null) Agent state cannot be determined, >>>>do >>>> > > nothing >>>> > > >>>> > > >>>> > > If I power on dead node, it goes to state "Connecting" and then >>>>"Up" >>>> > > in management interface. >>>> > > >>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) >>>> > > Ping timeout for host 12, do invstigation >>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) >>>> > > Ping timeout for host 12, do invstigation >>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) >>>> > > Ping timeout for host 12, do invstigation >>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] >>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = >>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name = >>>> > > ad112.colobridge.net] >>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] >>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name >>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected; >>>> > > new >>>> > status >>>> > > = Connecting; old update count = 1285; new update count = 1286] >>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] >>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = >>>> > > Enabled, Agent event = Ready, Host id = 12, name = >>>> > > ad112.colobridge.net] >>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] >>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name >>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready; >>>> > > new >>>> > status = >>>> > > Up; old update count = 1286; new update count = 1287] >>>> > > >>>> > > >>>> > > If I restart cloud-management service, dead node goes to state >>>> > > "Disconnected" in management interface. >>>> > > (there is nothing special in logs in this case) >>>> > > >>>> > > If I do nothing, dead node could stay in "Up" state forever (I >>>> > > waited >>>> > for >>>> > > 12 hours) in management interface, throwing into logs "Agent state >>>> > > cannot be determined, do nothing" >>>> > > >>>> > > Would appreciate if someone could help/suggest how to deal with >>>> > > this problem. >>>> > > >>>> > > -- >>>> > > Regards, >>>> > > Valery >>>> > > >>>> > > http://protocol.by/slayer >>>> > >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Valery >>>> >>>> http://protocol.by/slayer >>> This email and any attachments to it may be confidential and are >>>intended solely for the use of the individual to whom it is addressed. >>>Any views or opinions expressed are solely those of the author and do >>>not necessarily represent those of Shape Blue Ltd or related companies. >>>If you are not the intended recipient of this email, you must neither >>>take any action based upon its contents, nor copy or show it to anyone. >>>Please contact the sender if you believe you have received this email in >>>error. Shape Blue Ltd is a company incorporated in England & Wales. >>>ShapeBlue Services India LLP is operated under license from Shape Blue >>>Ltd. ShapeBlue is a registered trademark. >