[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953017#comment-14953017
 ] 

Simon Weller commented on CLOUDSTACK-8943:
------------------------------------------

Perhaps one of the easiest ways to deal with this would be to introduce IPMI 
functionality into Cloudstack, so a KVM host could be fenced via an out-of-band 
IPMI interface. Upon successful fencing, CS MGMT could mark the host as 
disabled. I know deleting a host is enough to force CS MGMT to attempt to 
restart affected VMs on other hosts, but I'm not sure whether disabling a host 
will at this point in time.

There are other considerations that will need to be made as well, especially 
around storage locking (e.g. CEPH).

> KVM HA is broken, let's fix it
> ------------------------------
>
>                 Key: CLOUDSTACK-8943
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>         Environment: Linux distros with KVM/libvirt
>            Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can 
> often fail whenever this network share becomes slower, causing the 
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary 
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and 
> declare it's not running VMs is a bad idea; this HV could require hours or 
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other 
> hypervisors doing it?
> Let's discuss, test, implement. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to