[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954834#comment-14954834
 ] 

Simon Weller commented on CLOUDSTACK-8943:
------------------------------------------

In my opinion, you have to be careful with quorum style functionality on the 
host itself. It can make the hosts overly complicated, especially when you have 
a lot of hosts within a cluster. 
If you've built your storage correctly, it should be highly unlikely that you 
lose your storage (unless you're running NFS with no HA functionality). And if 
you do, that becomes a cluster problem assuming you've dedicated storage just 
to that cluster. 
We've been running quorums using Redhat Cluster Suite for years on top of CLVM 
SAN backed storage and it has proven to be very difficult to manage as our 
infrastructure has grown. You really want to keep your hosts as simple as 
possible.
Now, if the CS agent was leveraged to provide a deeper view into what was going 
on with the host (and the attached storage), it may be you could gather enough 
information to make a determination of what should be done with a given host 
that appeared to be misbehaving. You could then fence it remotely using IPMI 
and that's what I was indicating above. That would simulate the principals of a 
quorum without having to add a complicated clustering layer onto the hosts.


> KVM HA is broken, let's fix it
> ------------------------------
>
>                 Key: CLOUDSTACK-8943
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>         Environment: Linux distros with KVM/libvirt
>            Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can 
> often fail whenever this network share becomes slower, causing the 
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary 
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and 
> declare it's not running VMs is a bad idea; this HV could require hours or 
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other 
> hypervisors doing it?
> Let's discuss, test, implement. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to