Yan Fitterer wrote:
Not always. The case I have encountered (live) doesn't relate to HB
component failure per se, but is nevertheless destructive.

With an eDirectory load (and other database-backed software with large
or lazily flushed write buffers would be similarly affected, IMHO), a
hard reset of a node has a high likelihood of corrupting the database.
This is in some cases no less destructive than allowing concurrent
access to, say, an ext3 filesystem...

If your software cannot withstand a crash, then it cannot be made highly-available - end of story. Crashes will happen. Be prepared.

This is a very serious application bug. I would consider this a priority one bug if it were my application.

I have been pondering for a while the possibility of using some
disk-based heartbeat to block STONITH, in cases where the STONITH target
is still writing its disk heartbeat. This would in this case prevent
data damage.

In addition, I have been thinking of complementing this mechanism with a
disk-based "STONITH" (otherwise known as "poison pill"...) so that the
unreachable node may (if things aren't too badly broken) take its
resources down, and stop the disk heartbeat, which would then allow the
rest of the cluster to consider it having left the cluster safely, and
migrate the resources.

Not quite sure how much of a fundamental change this would be though...

My suggestion for this would be to implement a full communications plugin module that sends packets through disk areas. If you do this right, then the communications will remain fully up for all purposes. We've had people start this effort in the past, but it's never been finished and all the bugs driven out AFAIK.

--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to