Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Andrew Beekhof Wed, 07 Nov 2007 01:00:12 -0800


On Nov 7, 2007, at 9:16 AM, matilda matilda wrote:

Alan Robertson <[EMAIL PROTECTED]> 07.11.2007 05:51 >>>
Actually there a plenty of people using it today.
I'd much prefer they had a real device, but they are aware of therisks
and seem happy enough.
Like lots of cases, people are happy enough until they get burned.Justlike the people with shared storage and no STONITH who are happyenoughuntil they get bit. And, there are plenty of those folks too -probably
quite a few more.
Hi Andrew, hi Alan,
I'm not playing in your "knowledge league", but at this point I haveto agree
with Alan. And this only by my own experiences with HAv2:
a) Someone getting a little more elaborate configuration to run isvery veryhappy that it works, meaning the services come up and some managableresourcefailovers are tested. A correct stonith configuration can onlyhardly be tested.And I'm pretty sure that the necessity to also test stonithbehaviour is
not seen appropriatly. (only my assumption).
My experience in the whole HAv2 cluster is that it behaves sometimesverydifferent under havy load. And under havy load I'm pretty sure thata ssh stonith
device would NOT work reliably.

I'm not arguing that ssh will work in 100% of cases - but what it willdo make the cluster wait until it does.

What I am also trying to point out, is that if the node is in such astate that ssh wont work reliably, then what are the chances that thenode will be able to commit suicide a) at all, b) before the clusterhas started the services for the second time.

I'd argue that it is exactly these situations where ssh is _better_than suicide.

b) You have to follow this list for months to get the feeling howimportanta reliably working external stonith device is. This has to beephasized more
than it is done in the documentation currently. (IMHO).
I had a long discussion explaining why we have to implement astonith devicecorrectly. (By the way, some will remember that there was a threadinitiated
by me about external stonith plugins).

Hence a suggestion:
- Emphasize the necessity of an external properly configured stonithdevice.


wiki.linux-ha.org :-)

Explain the reasons and scenarios which bring up this requirement.If theadministrators/implementers understand the potential risks they willthink
differently about that.
- RA-scripts should be taged so that a failover of a resourcemanaged by this
RA will NEVER happen if stonith is not configured.


I'm pretty sure you get this by setting on_fail=block

And actually if a stop fails and stonith is not enabled you end upblocking anyway.

But thats a different scenario to fencing in response to a node-levelfailure

Probably better explained: If (real) stonith is not configured therisk of damagingresources by starting them twice is given. This risk applies tocertain resources(RA). In a case where heartbeat does not know for sure if theresource is(properly) stopped on a node it should NEVER failover such aresource. Just keepthe fingers away from that. In such a case you have to provide akind ofnotification mechanism (probably something different than writingone line of
thousands to a log file) to get an administrator checking by hand.
So, certain RA have a flag (probably another attribute) saying notto failoverthis resource if stonith is not configured for a cluster using thisRA.
What do you think? Is this idea totally braindead?

A comment to what Kevin Tomlinson said:
I would also be happy if a problem of the HA subsystem would have noimpacton the availability of a services controlled by HA. Or better if HAwould heal
the appropriate subprocess on its own.


Mostly it does - with the exception of the "heartbeat" processes

And in most cases it even does so fast enough that the DC doesn't needto take any action (zero resource downtime).


Implementing suicide guarantees that there will be downtime.

But: Like said here in this thread. Heartbeat takes the role similarto "init".I assume "init" to be bullet proof. HA must be bullet proof.Everything must bedone, which makes HA work correctly. Don't spend developmentresources to aself healing mechanism for cases which shouldn't happen. If HAv2goes crazy
then everything is lost... rub it out!  ;-))
If someone thinks that the possibility or penalty of a serviceoutage caused byHA is higher than the possible service outage introduced by a hand-controlled-clustershould NOT use HA for high availability. And, by the way: There are/were guys on thelist who came to exactly this conclusion: Don't use complicated(early stage) HAv2and do resource failover by hand notified by network/systemmanagement tools.
Of course, only my opinion.

Best regards
Andreas Mock




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to