[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710029#comment-13710029
 ] 

Logan B commented on CLOUDSTACK-3535:
-------------------------------------

Please note that this bug does not only affect KVM.  We have experienced the 
same issue with XCP 1.6/XenServer hosts.

The problem stems from a previous fix to prevent a potential split brain issue 
when the management server loses connectivity to the cluster.  The AgentImpl 
function used to mark the host as down when it couldn't be reached, now it just 
marks it at "unable to determine state" and does nothing.  This does fix the 
split brain issue, but if the hosts actually goes down then HA will never take 
over.

I realize this is a tricky fix, and my programming knowledge is minimal, but I 
do have a suggestion for a fix.  The only time the management server should run 
into an actual split brain issue is if it loses connectivity to the clusters.  
Could the following logic be implemented?

( I apologize for the potentially confusing formatting.)

If: Management server cannot ping host:
-> Then: Try to ping management gateway.
--> If: Management server CAN ping gateway:
---> Then: Try to ping other hosts in cluster:
----> If: Other hosts can be pinged AND gateway can be pinged:
-----> Then: Start HA and send host down report/alert.
----> Else If: Other hosts CANNOT be pinged AND gateway CAN be pinged:
-----> Then: Send cluster connectivity alert, and do nothing with HA.
--> Else If: Management server CANNOT ping gateway:
---> Then: Attempt to send management connectivity alert, and do nothing with 
HA.

The only time I could see this causing an issue if if the networking for Host A 
goes down, HA migrates VMs to Host B, then Host A's networking comes back up 
with running VMs.  I don't see this being a very likely scenario though.

A short term solution would be to at least trigger some sort of alert/e-mail 
when the host status cannot be determined.  That way manual intervention can be 
started much more quickly.  Right now a host can be offline indefinitely 
without any notice.
                
> No HA actions are performed when a KVM host goes offline
> --------------------------------------------------------
>
>                 Key: CLOUDSTACK-3535
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Hypervisor Controller, KVM, Management Server
>    Affects Versions: 4.1.0, Future
>         Environment: KVM (CentOS 6.3) with CloudStack 4.1
>            Reporter: Paul Angus
>
> If a KVM host 'goes down', CloudStack does not perform HA for instances which 
> are marked as HA enabled on that host (including system VMs)
> CloudStack does not show the host as disconnected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to