We now have the ComponentFail test in CTS. Thanks Lars for getting it going!

And, in the process, it's showing up some kinds of problems that we hadn't been looking for before. A couple examples of such problems can be found here:

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:

For problems that should "never" happen like death of one of our core/key processes, is an immediate reboot of the machine the right recovery technique?

The advantages of such a choice include:
 It is fast
 It will invoke recovery paths that we exercise a lot in testing
 It is MUCH simpler than trying to recover from all these cases,
        therefore almost certainly more reliable

The disadvantages of such a choice include:
 It is crude, and very annoying
 It probably shouldn't be invoked for single-node clusters (?)
 It could be criticized as being lazy
 It shouldn't be invoked if there is another simple and correct method
 Continual rebooting becomes a possibility...

We do not have a policy of doing this throughout the project, what we have is a few places where we do it.

I propose that we should consider making a uniform policy decision for the project - and specifically decide to use ungraceful reboots as our recovery method for "key" processes dying (for example: CCM, heartbeat, CIB, CRM). It should work for those cases where people don't configure in watchdogs or explicitly define any STONITH devices, and also independently of quorum policies - because AFAIK it seems like the right choice, there's no technical reason not to do so.

My inclination is to think that this is a good approach to take for problems that in our best-guess judgment "shouldn't happen".


I'm bringing this to both lists, so that we can hear comments both from
developers and users.


Comments please...

--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to