Re: Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

matilda matilda Wed, 07 Nov 2007 00:16:44 -0800

>>> Alan Robertson <[EMAIL PROTECTED]> 07.11.2007 05:51 >>>
>> 
>> Actually there a plenty of people using it today.
>> I'd much prefer they had a real device, but they are aware of the risks 
>> and seem happy enough.
>
> Like lots of cases, people are happy enough until they get burned.  Just 
> like the people with shared storage and no STONITH who are happy enough 
> until they get bit.  And, there are plenty of those folks too - probably 
> quite a few more.


Hi Andrew, hi Alan,

I'm not playing in your "knowledge league", but at this point I have to agree
with Alan. And this only by my own experiences with HAv2:

a) Someone getting a little more elaborate configuration to run is very very
happy that it works, meaning the services come up and some managable resource
failovers are tested. A correct stonith configuration can only hardly be tested.
And I'm pretty sure that the necessity to also test stonith behaviour is
not seen appropriatly. (only my assumption).

My experience in the whole HAv2 cluster is that it behaves sometimes very
different under havy load. And under havy load I'm pretty sure that a ssh 
stonith
device would NOT work reliably.


b) You have to follow this list for months to get the feeling how important
a reliably working external stonith device is. This has to be ephasized more
than it is done in the documentation currently. (IMHO).
I had a long discussion explaining why we have to implement a stonith device
correctly. (By the way, some will remember that there was a thread initiated
by me about external stonith plugins).

Hence a suggestion:
- Emphasize the necessity of an external properly configured stonith device.
Explain the reasons and scenarios which bring up this requirement. If the
administrators/implementers understand the potential risks they will think
differently about that.

- RA-scripts should be taged so that a failover of a resource managed by this
RA will NEVER happen if stonith is not configured.
Probably better explained: If (real) stonith is not configured the risk of 
damaging
resources by starting them twice is given. This risk applies to certain 
resources
(RA). In a case where heartbeat does not know for sure if the resource is
(properly) stopped on a node it should NEVER failover such a resource. Just keep
the fingers away from that. In such a case you have to provide a kind of 
notification mechanism (probably something different than writing one line of
thousands to a log file) to get an administrator checking by hand.
So, certain RA have a flag (probably another attribute) saying not to failover
this resource if stonith is not configured for a cluster using this RA.

What do you think? Is this idea totally braindead?


A comment to what Kevin Tomlinson said:
I would also be happy if a problem of the HA subsystem would have no impact
on the availability of a services controlled by HA. Or better if HA would heal
the appropriate subprocess on its own.
But: Like said here in this thread. Heartbeat takes the role similar to "init".
I assume "init" to be bullet proof. HA must be bullet proof. Everything must be
done, which makes HA work correctly. Don't spend development resources to a
self healing mechanism for cases which shouldn't happen. If HAv2 goes crazy
then everything is lost... rub it out!  ;-))

If someone thinks that the possibility or penalty of a service outage caused by
HA is higher than the possible service outage introduced by a 
hand-controlled-cluster
should NOT use HA for high availability. And, by the way: There are/were guys 
on the
list who came to exactly this conclusion: Don't use complicated (early stage) 
HAv2 
and do resource failover by hand notified by network/system management tools.

Of course, only my opinion.

Best regards
Andreas Mock




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to