[
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954787#comment-14954787
]
Ronald van Zantvoort commented on CLOUDSTACK-8943:
--------------------------------------------------
In clusters, there is such a thing as 'quorum'.
When the majority of the hypervisors believes the NFS to be unreachable (or any
storage for that matter), there's a pretty good chance the storage is down
rather than there's something wrong with all hypervisors.
Therefore, the proper way to fix this is to introduce this concept of 'quorum'
so that CloudStack can either fence a single host unable to connect, or declare
a problem with the storage.
> KVM HA is broken, let's fix it
> ------------------------------
>
> Key: CLOUDSTACK-8943
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Environment: Linux distros with KVM/libvirt
> Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can
> often fail whenever this network share becomes slower, causing the
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and
> declare it's not running VMs is a bad idea; this HV could require hours or
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other
> hypervisors doing it?
> Let's discuss, test, implement. :)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)