[
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963942#comment-14963942
]
Marcus Sorensen commented on CLOUDSTACK-8943:
---------------------------------------------
Ronald, thanks for reiterating my points (I guess? I'm not sure if you're
quoting me from another thread, but that's the hard stance I've continually
taken on the point; that we need fencing if we want to start VMs elsewhere).
I think in general CloudStack already has a fair mechanism for detecting if a
hypervisor has issues (the investigator thing where the mgmt server asks
cluster members), but it simply sets the hypervisor to 'Alert' state, rather
than doing something that will allow the hypervisor's VMs to start somewhere
else. Since the cluster is not autonomous, that is it's not capable of starting
or migrating VMs on its own, and the locks for doing so are managed by the
management server, I pictured the management server orchestrating IPMI actions.
This is probably safer than granting hypervisors access to their own IPMI
network, and provides an easier access point for the option of abstracting into
a proxy or similar as mentioned.
With all of this in mind, a basic first step would probably have a
cluster-level boolean (hypervisor.reset.on.alert?) setting that triggers IPMI
reset when a hypervisor goes into Alert state. This IPMI reset would also set
all VMs on the hypervisor to state 'Stopped' after successful IPMI reset, and
HA would just kick in and start the VMs elsewhere. The hypervisor info would
also have to be extended to include the IPMI info for the hypervisor.
> KVM HA is broken, let's fix it
> ------------------------------
>
> Key: CLOUDSTACK-8943
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Environment: Linux distros with KVM/libvirt
> Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can
> often fail whenever this network share becomes slower, causing the
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and
> declare it's not running VMs is a bad idea; this HV could require hours or
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other
> hypervisors doing it?
> Let's discuss, test, implement. :)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)