Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Alan Robertson Tue, 06 Nov 2007 14:18:40 -0800

Yan Fitterer wrote:

Not always. The case I have encountered (live) doesn't relate to HB
component failure per se, but is nevertheless destructive.


With an eDirectory load (and other database-backed software with large
or lazily flushed write buffers would be similarly affected, IMHO), a
hard reset of a node has a high likelihood of corrupting the database.
This is in some cases no less destructive than allowing concurrent
access to, say, an ext3 filesystem...

If your software cannot withstand a crash, then it cannot be madehighly-available - end of story. Crashes will happen. Be prepared.

This is a very serious application bug. I would consider this apriority one bug if it were my application.

I have been pondering for a while the possibility of using some
disk-based heartbeat to block STONITH, in cases where the STONITH target
is still writing its disk heartbeat. This would in this case prevent
data damage.

In addition, I have been thinking of complementing this mechanism with a
disk-based "STONITH" (otherwise known as "poison pill"...) so that the
unreachable node may (if things aren't too badly broken) take its
resources down, and stop the disk heartbeat, which would then allow the
rest of the cluster to consider it having left the cluster safely, and
migrate the resources.

Not quite sure how much of a fundamental change this would be though...

My suggestion for this would be to implement a full communicationsplugin module that sends packets through disk areas. If you do thisright, then the communications will remain fully up for all purposes.We've had people start this effort in the past, but it's never beenfinished and all the bugs driven out AFAIK.


--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let meclaim from you at all times your undisguised opinions." - WilliamWilberforce

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to