[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963016#comment-14963016
 ] 

Ronald van Zantvoort commented on CLOUDSTACK-8943:
--------------------------------------------------

[[email protected]]: Thanks for the design document. I can't 
comment in Confluence, so here goes:

* When to fence; [~sweller]: Of course you're right that it should be highly 
unlikely that your storage completely dissappears from the cluster. Be that as 
it may, as you yourself note, first of all if you're using NFS without HA that 
likelihood increases manyfold. Secondly, defining it as an anlikely disastrous 
event seems no reason not to take it into account; making it a catastrophic 
event by 'fencing' all affected hypervisors will not serve anyone as it would 
be unexpected and unwelcome. 
* The entire concept of fencing exists to absolutely ensure state. Specifically 
in this regard the state of the block devices and their data. [~shadowsor]: For 
that same reason it's not reasonable to 'just assume' VM's gone. There's a ton 
of failure domains that could cause an agent to disconnect from the manager but 
still have the same VM's running, and there's nothing stopping CloudStack from 
starting the same VM twice on the same block devices, with desastrous results. 
That's why you *need* to *know* the VM's are *very definitely* not running 
anymore, which is exactly what fencing is supposed to do.
* For this, IPMI fencing is a nice and very often used option; absolutely 
ensuring a hypervisor has died, and ergo the running VM's. It will however not 
fix the case of the mass rebooting hypervisors (but rather quite likely making 
it even more of an adventure if not addressed properly)


Now, with all that in mind, I'd like to make the following comments regarding 
[[email protected]] 's design.

* First of the IPMI implementation: There's is IMHO no need to define IPMI 
(Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all these are 
standard commands. For example, using the venerable `ipmitool` gives you 
chassis power on,status,poweroff,reset etc. which will work on *any* IPMI 
device; only authentication details (User, Pass, Proto) differ. There's bound 
to be some library that does it without having to resort to (possibly numerous) 
different (versions of) external binaries.

* Secondly you're assuming that hypervisors can access the IPMI's of their 
cluster/pod peers; although I'm not against this assumption per sé, I'm also 
not convinced we're servicing everybody by forcing that assumption to be true; 
some kind of IPMI agent/proxy comes to mind, or even relegating the task back 
to the manager or some SystemVM. Also bear in mind that you need access to 
those IPMI's to ensure cluster functionality, so a failure domain should be in 
maintenance state if any of the fence devices can't be reached

* Thirdly your proposed testing algorithm needs more discussion; after all, it 
directly hits the fundamental principal reasons for *why* to fence a host, and 
that's a lot more than just 'these disks still gets writes'. In fact, by the 
time you're checking this, you're probably already assuming something's very 
wrong with the hypervisor, so why not just fence it then? The decision to fence 
should lie with the first notification that some is (very) wrong with the 
hypervisor, and only limited attempts should be made to get it out (say it 
can't reach it's storage and that get's you your HA actions; why check for the 
disks first? Try to get the storage back up like 3 times, or for 90 sec or so, 
then fence the fucker and HA the VM's immediately after confirmation)

* Finally as mentioned you're not solving the 'o look, my storage is gone, 
let's fence' * (N) problem; in the case of a failing NFS:
  * Every host will start IPMI resetting every other hypervisor; by then 
there's a good chance every hypervisor in all connected clusters are rebooting, 
leaving a state where there's no hypervisors in the cluster to fence others; 
that in turn should lead to the cluster falling in maintenance state, which 
will lead to even more bells & whistles going off.
  * They'll come back, find the NFS still gone, and continue resetting each 
other like there's no tomorrow
  * Support staff already panicking over the NFS/network outage now has to deal 
with entire clusters of hypervisors in perpetual reboot as well as clusters 
which are completely unreachable because there's no one left to check state; 
this all while the outage might simply require the revert of some inadvertent 
network ACL snafu
Although I well understand [~sweller]'s concerns regarding agent complexity in 
this regard, quorum is the standard way of solving that problem. On the other 
hand, once the Agents start talking to each other and the Manager over some 
standard messaging API/bus this problem might well be solved for you; getting, 
say, Gossip or Paxos or any other clustering/quorum protocol shouldn't be that 
hard considering the amount of Java software already doing just that out there.
Another idea would be to introduce some other kind of storage monitoring, for 
example by a SystemVM or something.

If you'll insist on the 'clusters fence themselves' paradigm, you could maybe 
also introduce a constraint that a node is only allowed to fence others if 
itself is healthy; ergo if it doesn't have all storages available, it doesn't 
get to fence others whose storage isn't available.


> KVM HA is broken, let's fix it
> ------------------------------
>
>                 Key: CLOUDSTACK-8943
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>         Environment: Linux distros with KVM/libvirt
>            Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can 
> often fail whenever this network share becomes slower, causing the 
> hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary 
> storages in place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and 
> declare it's not running VMs is a bad idea; this HV could require hours or 
> days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other 
> hypervisors doing it?
> Let's discuss, test, implement. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to