Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Yan Fitterer Wed, 07 Nov 2007 08:16:22 -0800


Andrew Beekhof wrote:
> 
> On Nov 7, 2007, at 9:16 AM, matilda matilda wrote:
> 
>>>>> Alan Robertson <[EMAIL PROTECTED]> 07.11.2007 05:51 >>>
>>>>
>>>> Actually there a plenty of people using it today.
>>>> I'd much prefer they had a real device, but they are aware of the risks
>>>> and seem happy enough.
>>>
>>> Like lots of cases, people are happy enough until they get burned.  Just
>>> like the people with shared storage and no STONITH who are happy enough
>>> until they get bit.  And, there are plenty of those folks too - probably
>>> quite a few more.
>>
>> Hi Andrew, hi Alan,
>>
>> I'm not playing in your "knowledge league", but at this point I have
>> to agree
>> with Alan. And this only by my own experiences with HAv2:
>>
>> a) Someone getting a little more elaborate configuration to run is
>> very very
>> happy that it works, meaning the services come up and some managable
>> resource
>> failovers are tested. A correct stonith configuration can only hardly
>> be tested.
>> And I'm pretty sure that the necessity to also test stonith behaviour is
>> not seen appropriatly. (only my assumption).
>>
>> My experience in the whole HAv2 cluster is that it behaves sometimes very
>> different under havy load. And under havy load I'm pretty sure that a
>> ssh stonith
>> device would NOT work reliably.
> 
> I'm not arguing that ssh will work in 100% of cases - but what it will
> do make the cluster wait until it does.
> 
> What I am also trying to point out, is that if the node is in such a
> state that ssh wont work reliably, then what are the chances that the
> node will be able to commit suicide a) at all, b) before the cluster has
> started the services for the second time.
> 
> I'd argue that it is exactly these situations where ssh is _better_ than
> suicide.


My 2c... Although my experience is rather limited, I have encountered
one real-life situation where ssh would not have worked. (split brain
created by putting firewall in "closed" mode, i.e. all inbound IP
packets rejected by iptables, but outbound packets allowed). So the
cases where ssh is unsuitable are not that unusual.

I must say I concur with Alan, in real life, SSH is far too fragile to
be a reliable STONITH method.

Ability to suicide would certainly be good to have in addition to
STONITH IMHO.


> 
>> b) You have to follow this list for months to get the feeling how
>> important
>> a reliably working external stonith device is. This has to be
>> ephasized more
>> than it is done in the documentation currently. (IMHO).
>> I had a long discussion explaining why we have to implement a stonith
>> device
>> correctly. (By the way, some will remember that there was a thread
>> initiated
>> by me about external stonith plugins).
>>
>> Hence a suggestion:
>> - Emphasize the necessity of an external properly configured stonith
>> device.
> 
> wiki.linux-ha.org :-)
> 
>>
>> Explain the reasons and scenarios which bring up this requirement. If the
>> administrators/implementers understand the potential risks they will
>> think
>> differently about that.
>>
>> - RA-scripts should be taged so that a failover of a resource managed
>> by this
>> RA will NEVER happen if stonith is not configured.
> 
> I'm pretty sure you get this by setting on_fail=block
> And actually if a stop fails and stonith is not enabled you end up
> blocking anyway.
> 
> But thats a different scenario to fencing in response to a node-level
> failure
> 
>>
>> Probably better explained: If (real) stonith is not configured the
>> risk of damaging
>> resources by starting them twice is given. This risk applies to
>> certain resources
>> (RA). In a case where heartbeat does not know for sure if the resource is
>> (properly) stopped on a node it should NEVER failover such a resource.
>> Just keep
>> the fingers away from that. In such a case you have to provide a kind of
>> notification mechanism (probably something different than writing one
>> line of
>> thousands to a log file) to get an administrator checking by hand.
>> So, certain RA have a flag (probably another attribute) saying not to
>> failover
>> this resource if stonith is not configured for a cluster using this RA.
>>
>> What do you think? Is this idea totally braindead?
>>
>> A comment to what Kevin Tomlinson said:
>> I would also be happy if a problem of the HA subsystem would have no
>> impact
>> on the availability of a services controlled by HA. Or better if HA
>> would heal
>> the appropriate subprocess on its own.
> 
> Mostly it does - with the exception of the "heartbeat" processes
> And in most cases it even does so fast enough that the DC doesn't need
> to take any action (zero resource downtime).
> 
> Implementing suicide guarantees that there will be downtime.
> 
>>
>> But: Like said here in this thread. Heartbeat takes the role similar
>> to "init".
>> I assume "init" to be bullet proof. HA must be bullet proof.
>> Everything must be
>> done, which makes HA work correctly. Don't spend development resources
>> to a
>> self healing mechanism for cases which shouldn't happen. If HAv2 goes
>> crazy
>> then everything is lost... rub it out!  ;-))
>>
>> If someone thinks that the possibility or penalty of a service outage
>> caused by
>> HA is higher than the possible service outage introduced by a
>> hand-controlled-cluster
>> should NOT use HA for high availability. And, by the way: There
>> are/were guys on the
>> list who came to exactly this conclusion: Don't use complicated (early
>> stage) HAv2
>> and do resource failover by hand notified by network/system management
>> tools.
>>
>> Of course, only my opinion.
>>
>> Best regards
>> Andreas Mock
>>
>>
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to