[jira] [Commented] (CLOUDSTACK-8943) KVM HA is broken, let's fix it

Marcus Sorensen (JIRA) Mon, 19 Oct 2015 12:54:17 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963942#comment-14963942
 ]


Marcus Sorensen commented on CLOUDSTACK-8943:
---------------------------------------------

Ronald, thanks for reiterating my points (I guess? I'm not sure if you're 
quoting me from another thread, but that's the hard stance I've continually 
taken on the point; that we need fencing if we want to start VMs elsewhere).  

I think in general CloudStack already has a fair mechanism for detecting if a 
hypervisor has issues (the investigator thing where the mgmt server asks 
cluster members), but it simply sets the hypervisor to 'Alert' state, rather 
than doing something that will allow the hypervisor's VMs to start somewhere 
else. Since the cluster is not autonomous, that is it's not capable of starting 
or migrating VMs on its own, and the locks for doing so are managed by the 
management server, I pictured the management server orchestrating IPMI actions. 
This is probably safer than granting hypervisors access to their own IPMI 
network, and provides an easier access point for the option of abstracting into 
a proxy or similar as mentioned.

With all of this in mind, a basic first step would probably have a 
cluster-level boolean (hypervisor.reset.on.alert?) setting that triggers IPMI 
reset when a hypervisor goes into Alert state. This IPMI reset would also set 
all VMs on the hypervisor to state 'Stopped' after successful IPMI reset, and 
HA would just kick in and start the VMs elsewhere. The hypervisor info would 
also have to be extended to include the IPMI info for the hypervisor.

> KVM HA is broken, let's fix it
> ------------------------------
>
>                 Key: CLOUDSTACK-8943
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>         Environment: Linux distros with KVM/libvirt
>            Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can 
> often fail whenever this network share becomes slower, causing the 
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary 
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and 
> declare it's not running VMs is a bad idea; this HV could require hours or 
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other 
> hypervisors doing it?
> Let's discuss, test, implement. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CLOUDSTACK-8943) KVM HA is broken, let's fix it

Reply via email to