[
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128217#comment-14128217
]
Brenn Oosterbaan commented on CLOUDSTACK-7184:
----------------------------------------------
Implementing the solution above looks like the fastest/easiest fix to me.
Ideally you would want a second heartbeat script on the hypervisor as well. One
which checks if Cloudstack has been able to communicate with the hypervisor,
and if not fences the hypervisor using the same interval and retry counts as
Cloudstack.
This would allow us to set different timeouts for different scenario's without
the possibility of corrupted VM's:
- If the storage is unreachable but the hypervisor is up and running - wait
until storage timeout value is reached and reboot.
- If the hypervisor has crashed OR the networking stack for that hypervisor is
stuck/unreachable - wait until the hypervisor max retry value is met, HA the
vm's and reboot the hypervisor.
That way we could decide to set the hypervisor.heartbeat.interval to 5 and the
hypervisor.heartbeat.max_retry to 6 and the storage timeout to 180. Which would
effectively say: if storage is down wait 180 seconds, if something else
happened only wait 30 seconds.
But this should probably be a feature request.
regards,
Brenn
> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA
> on vm's when host is marked down
> ------------------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-7184
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Hypervisor Controller, Management Server, XenServer
> Affects Versions: 4.3.0, 4.4.0, 4.5.0
> Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
> Reporter: Remi Bergsma
> Assignee: Daan Hoogland
> Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did
> discover this and marked the host as down, and immediately started HA. Just
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that
> were running on two hypervisors at the same time.
> This, of course, resulted in file system corruption and the loss of the vm's.
> One side of the story is why XenServer allowed this to happen (will not
> bother you with this one). The CloudStack side of the story: HA should only
> start after at least xen.heartbeat.interval seconds. If the host is down long
> enough, the Xen heartbeat script will fence the hypervisor and prevent
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache]
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl]
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on
> the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event =
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25 05:03:31,920
> cs marks host up: 2014-07-25 05:03:49,655
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)