Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Chiradeep Vittal Mon, 15 Jul 2013 04:03:49 -0700

A robust solution would probably involve Apache Zookeeper (using Curator
perhaps) to perform robust distributed locking and/or leader election.


On 7/15/13 3:51 PM, "Chiradeep Vittal" <[email protected]> wrote:

>Indeed HA is very tricky as you note. In the generic case where the MS
>cannot communicate with the agent, nothing can be concluded and the MS
>does nothing.
>I dug this up and posted it to the wiki
>https://cwiki.apache.org/confluence/x/dwn8AQ
>
>
>On 7/15/13 1:20 PM, "Marcus Sorensen" <[email protected]> wrote:
>
>>I don't know much about HA in regards to management server/agent
>>connectivity, but it seems to me like this is perilous ground.  If a
>>host loses connection with the management server, it seems to me that
>>the management server doesn't have the resources to determine whether
>>it should start HA-enabled VMs elsewhere. You could very well end up
>>with VMs running in two or three places at once, corrupting them, just
>>because a host failed to check in. Maybe the agent was stopped (that
>>happens all the time). The management server has no fencing
>>capaiblity, hence the messages "I don't know, doing nothing", are the
>>correct thing to do. That doesn't seem like it's KVM specific,
>>however.
>>
>>I'm very interested in hearing the details on how this HA was intended
>>to work, or how it might be working on other platforms.  One solution
>>may be to leverage the secondary storage to create locks for VMs, then
>>again, when VMs can run without the agent it seems prone to deadlock
>>(how does another node take over when another host has the lock, but
>>the host seems down, but is actually running the vm?).
>>
>>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <[email protected]>
>>wrote:
>>> I bumped this from the user list as we've just come across the same
>>>issue.
>>>
>>> CloudStack does not react or even change host status when contact is
>>>lost with a KVM host.
>>>
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
>>>null ('I don't know')
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's
>>>host, returning that we don't have enough information
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>>Moving on.
>>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>>Moving on.
>>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
>>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>>
>>> HA for KVM is almost useless.
>>>
>>> I suggest this a blocker for any release until fixed.
>>>
>>>
>>> Regards,
>>>
>>> Paul Angus
>>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>>> [email protected]
>>>
>>> -----Original Message-----
>>> From: Koushik Das [mailto:[email protected]]
>>> Sent: 12 July 2013 12:21
>>> To: [email protected]
>>> Subject: RE: cs 4.1 host disconnected status
>>>
>>> I looked at the logs and none of the existing investigators are able to
>>>determine that the host is down. I am not sure if there is a clean way
>>>to identify if a host is down in case of KVM. Consider the following
>>>cases:
>>>
>>> 1. Host is actually shutdown
>>> 2. Management nic of the host is plugged out of the network but host is
>>>up and running
>>>
>>> There is no clean way to distinguish these cases. Cloudstack should
>>>only mark the host as down in the first case. But not sure how one would
>>>achieve this.
>>>
>>> -Koushik
>>>
>>>> -----Original Message-----
>>>> From: Valery Ciareszka [mailto:[email protected]]
>>>> Sent: Friday, July 12, 2013 2:39 PM
>>>> To: [email protected]
>>>> Subject: Re: cs 4.1 host disconnected status
>>>>
>>>> I've simulated crash again and here is the log:
>>>> http://thesuki.org/temp/cs.log.txt
>>>> I stripped out of there GET requests with api keys.
>>>> Server was switched off at 8:36
>>>>
>>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
>>>><[email protected]>wrote:
>>>>
>>>> > Looks like the KVM investigator is not able to determine the state
>>>> > of the agent. Can you share the full log?
>>>> >
>>>> > > -----Original Message-----
>>>> > > From: Valery Ciareszka [mailto:[email protected]]
>>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>>> > > To: users
>>>> > > Subject: cs 4.1 host disconnected status
>>>> > >
>>>> > > Hi all.
>>>> > >
>>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>>> > > secondary storage.
>>>> > > and I have the following problem:
>>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>>> > > crash), it
>>>> > never
>>>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>>>> > > VMs are not restarted on another hypervisor node, because it
>>>> > > believes that disconnected node is still online.
>>>> > >
>>>> > >
>>>> > > I get following in management server logs:
>>>> > >
>>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
>>>>Processing:
>>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>>> > > [{"Answer":{"result":false,"details":     "Unable to ping
>>>>computing host,
>>>> > > exiting","wait":0}}] }
>>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>>>>MgmtId:
>>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>>> > > returning
>>>> > null
>>>> > > ('I don't know')
>>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
>>>>agent's
>>>> > > host, returning that we don't have enough information
>>>> > > 2013-07-11 10:19:16,153 DEBUG
>>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>>host.
>>>> > >  Moving on.
>>>> > > 2013-07-11 10:19:16,153 DEBUG
>>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>>host.
>>>> > >  Moving on.
>>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
>>>>do
>>>> > > nothing
>>>> > >
>>>> > >
>>>> > > If I power on dead node, it goes to state "Connecting" and then
>>>>"Up"
>>>> > > in management interface.
>>>> > >
>>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>>> > > Ping timeout for host 12, do invstigation
>>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>>> > > ad112.colobridge.net]
>>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>>>> > > new
>>>> > status
>>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>>> > > ad112.colobridge.net]
>>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>>>> > > new
>>>> > status =
>>>> > > Up; old update count = 1286; new update count = 1287]
>>>> > >
>>>> > >
>>>> > > If I restart cloud-management service, dead node goes to state
>>>> > > "Disconnected" in management interface.
>>>> > > (there is nothing special in logs in this case)
>>>> > >
>>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>>> > > waited
>>>> > for
>>>> > > 12 hours) in management interface, throwing into logs "Agent state
>>>> > > cannot be determined, do nothing"
>>>> > >
>>>> > > Would appreciate if someone could help/suggest how to deal with
>>>> > > this problem.
>>>> > >
>>>> > > --
>>>> > > Regards,
>>>> > > Valery
>>>> > >
>>>> > > http://protocol.by/slayer
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Valery
>>>>
>>>> http://protocol.by/slayer
>>> This email and any attachments to it may be confidential and are
>>>intended solely for the use of the individual to whom it is addressed.
>>>Any views or opinions expressed are solely those of the author and do
>>>not necessarily represent those of Shape Blue Ltd or related companies.
>>>If you are not the intended recipient of this email, you must neither
>>>take any action based upon its contents, nor copy or show it to anyone.
>>>Please contact the sender if you believe you have received this email in
>>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>>Ltd. ShapeBlue is a registered trademark.
>

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Reply via email to