Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Yan Fitterer Tue, 06 Nov 2007 12:24:57 -0800

Not always. The case I have encountered (live) doesn't relate to HB
component failure per se, but is nevertheless destructive.

With an eDirectory load (and other database-backed software with large
or lazily flushed write buffers would be similarly affected, IMHO), a
hard reset of a node has a high likelihood of corrupting the database.
This is in some cases no less destructive than allowing concurrent
access to, say, an ext3 filesystem...

I have been pondering for a while the possibility of using some
disk-based heartbeat to block STONITH, in cases where the STONITH target
is still writing its disk heartbeat. This would in this case prevent
data damage.

In addition, I have been thinking of complementing this mechanism with a
disk-based "STONITH" (otherwise known as "poison pill"...) so that the
unreachable node may (if things aren't too badly broken) take its
resources down, and stop the disk heartbeat, which would then allow the
rest of the cluster to consider it having left the cluster safely, and
migrate the resources.

Not quite sure how much of a fundamental change this would be though...

Yan

PS: I must admit this was fuelled by the promise from Novell to release
the NCS SBD code under GPL, but unfortunately, they are either late, or
have decided not to proceed. Shame - I was hoping to take some time this
month to hack around this.

Alan Robertson wrote:
> We now have the ComponentFail test in CTS.  Thanks Lars for getting it
> going!
> 
> And, in the process, it's showing up some kinds of problems that we
> hadn't been looking for before.  A couple examples of such problems can
> be found here:
> 
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
> 
> The question that comes up is this:
> 
> For problems that should "never" happen like death of one of our
> core/key processes, is an immediate reboot of the machine the right
> recovery technique?
> 
> The advantages of such a choice include:
>  It is fast
>  It will invoke recovery paths that we exercise a lot in testing
>  It is MUCH simpler than trying to recover from all these cases,
>     therefore almost certainly more reliable
> 
> The disadvantages of such a choice include:
>  It is crude, and very annoying
>  It probably shouldn't be invoked for single-node clusters (?)
>  It could be criticized as being lazy
>  It shouldn't be invoked if there is another simple and correct method
>  Continual rebooting becomes a possibility...
> 
> We do not have a policy of doing this throughout the project, what we
> have is a few places where we do it.
> 
> I propose that we should consider making a uniform policy decision for
> the project - and specifically decide to use ungraceful reboots as our
> recovery method for "key" processes dying (for example: CCM, heartbeat,
> CIB, CRM).  It should work for those cases where people don't configure
> in watchdogs or explicitly define any STONITH devices, and also
> independently of quorum policies - because AFAIK it seems like the right
> choice, there's no technical reason not to do so.
> 
> My inclination is to think that this is a good approach to take for
> problems that in our best-guess judgment "shouldn't happen".
> 
> 
> I'm bringing this to both lists, so that we can hear comments both from
> developers and users.
> 
> 
> Comments please...
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to