Andrew Beekhof wrote:
On Nov 7, 2007, at 9:16 AM, matilda matilda wrote:
Alan Robertson <[EMAIL PROTECTED]> 07.11.2007 05:51 >>>
Actually there a plenty of people using it today.
I'd much prefer they had a real device, but they are aware of the risks
and seem happy enough.
Like lots of cases, people are happy enough until they get burned. Just
like the people with shared storage and no STONITH who are happy enough
until they get bit. And, there are plenty of those folks too - probably
quite a few more.
Hi Andrew, hi Alan,
I'm not playing in your "knowledge league", but at this point I have
to agree
with Alan. And this only by my own experiences with HAv2:
a) Someone getting a little more elaborate configuration to run is
very very
happy that it works, meaning the services come up and some managable
resource
failovers are tested. A correct stonith configuration can only hardly
be tested.
And I'm pretty sure that the necessity to also test stonith behaviour is
not seen appropriatly. (only my assumption).
My experience in the whole HAv2 cluster is that it behaves sometimes very
different under havy load. And under havy load I'm pretty sure that a
ssh stonith
device would NOT work reliably.
I'm not arguing that ssh will work in 100% of cases - but what it will
do make the cluster wait until it does.
What I am also trying to point out, is that if the node is in such a
state that ssh wont work reliably, then what are the chances that the
node will be able to commit suicide a) at all, b) before the cluster has
started the services for the second time.
I'd argue that it is exactly these situations where ssh is _better_ than
suicide.
b) You have to follow this list for months to get the feeling how
important
a reliably working external stonith device is. This has to be
ephasized more
than it is done in the documentation currently. (IMHO).
I had a long discussion explaining why we have to implement a stonith
device
correctly. (By the way, some will remember that there was a thread
initiated
by me about external stonith plugins).
Hence a suggestion:
- Emphasize the necessity of an external properly configured stonith
device.
wiki.linux-ha.org :-)
Explain the reasons and scenarios which bring up this requirement. If the
administrators/implementers understand the potential risks they will
think
differently about that.
- RA-scripts should be taged so that a failover of a resource managed
by this
RA will NEVER happen if stonith is not configured.
I'm pretty sure you get this by setting on_fail=block
And actually if a stop fails and stonith is not enabled you end up
blocking anyway.
But thats a different scenario to fencing in response to a node-level
failure
Probably better explained: If (real) stonith is not configured the
risk of damaging
resources by starting them twice is given. This risk applies to
certain resources
(RA). In a case where heartbeat does not know for sure if the resource is
(properly) stopped on a node it should NEVER failover such a resource.
Just keep
the fingers away from that. In such a case you have to provide a kind of
notification mechanism (probably something different than writing one
line of
thousands to a log file) to get an administrator checking by hand.
So, certain RA have a flag (probably another attribute) saying not to
failover
this resource if stonith is not configured for a cluster using this RA.
What do you think? Is this idea totally braindead?
A comment to what Kevin Tomlinson said:
I would also be happy if a problem of the HA subsystem would have no
impact
on the availability of a services controlled by HA. Or better if HA
would heal
the appropriate subprocess on its own.
Mostly it does - with the exception of the "heartbeat" processes
And in most cases it even does so fast enough that the DC doesn't need
to take any action (zero resource downtime).
Implementing suicide guarantees that there will be downtime.
Since you're relying on STONITH for recovery, the same thing is true.
There is no difference.
If you want to recover without downtime, then as soon you get the
stonith-free recovery working for a particular process, then disable the
automatic reboot for that process. Then we won't reboot, and you won't
stonith, and everyone will be happy.
And, at the moment, the code only reboots for death-by-signal - and
thinking about if that is the right choice.
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems