Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Yan Fitterer Wed, 07 Nov 2007 17:24:46 -0800


Alan Robertson wrote:
> Yan Fitterer wrote:
>> Not always. The case I have encountered (live) doesn't relate to HB
>> component failure per se, but is nevertheless destructive.
>>
>> With an eDirectory load (and other database-backed software with large
>> or lazily flushed write buffers would be similarly affected, IMHO), a
>> hard reset of a node has a high likelihood of corrupting the database.
>> This is in some cases no less destructive than allowing concurrent
>> access to, say, an ext3 filesystem...
> 
> If your software cannot withstand a crash, then it cannot be made
> highly-available - end of story.  Crashes will happen.  Be prepared.


This is a fine argument from an engineering perspective, but not much
use from a sysadmin POV. Heartbeat should (can and does!) help on any
kind of software. I'm simply pointing out that (for less perfect
software, amongst other reasons) the less STONITH (hard reset) potential
cases we have, the better. :) Anything to avoid STONITH (in particular
when a node isn't quite dead from the workload perspective).

>> I have been pondering for a while the possibility of using some
>> disk-based heartbeat to block STONITH, in cases where the STONITH target
>> is still writing its disk heartbeat. This would in this case prevent
>> data damage.
>>
>> In addition, I have been thinking of complementing this mechanism with a
>> disk-based "STONITH" (otherwise known as "poison pill"...) so that the
>> unreachable node may (if things aren't too badly broken) take its
>> resources down, and stop the disk heartbeat, which would then allow the
>> rest of the cluster to consider it having left the cluster safely, and
>> migrate the resources.
>>
>> Not quite sure how much of a fundamental change this would be though...
> 
> My suggestion for this would be to implement a full communications
> plugin module that sends packets through disk areas.  If you do this
> right, then the communications will remain fully up for all purposes.
> We've had people start this effort in the past, but it's never been
> finished and all the bugs driven out AFAIK.

Agreed. Since I can't make much headway with my other approach(es)...
(and since having thought about it, they're certainly very much inferior
to a full disk-based comms)... I happen to have a little time on my
hands this month, and an itch to do some hacking.

Can anybody point me to the patch(es) with whatever code we have around
this?

Is anybody else coding on this right now?

Thanks
Yan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to