On 2009-05-06T11:45:36, [email protected] wrote:
> Thanks to Lars and Dominik for your help. I have read up in the SLE-11 HA PDF
> (an excellent document) and I understand a lot more of it now.
>
> The crm shell is awesome. I had at first discounted it because I was used to
> using the cibadmin tool. But now I can see it's power I don't use anything
> else.
>
> My cluster is working exactly as I want now.
Great! That's good news.
> I'm still not 100% sold on the value of STONITH or fencing in the real
> world but I'm off to do more reading up on it.
Well, one of the values is that if you call Novell support, they'll
actually listen to you instead of saying "No STONITH? Gee, that's too
bad, come back with a supported configuration" ;-)
Joking (even if true) aside:
In case a resource fails to stop on a migration for example, the cluster
is blocked and cannot continue. If you have STONITH configured, this
will be cleared up by fencing the node, which implies the resource is
stopped and thus can continue.
In theory, resource agents aren't ever allowed to fail the 'stop' op.
But it can happen, if the service is truly broken, the RA has a bug, the
kernel is confused, the disk has gone haywire and blocks, ... So this
error scenario cannot be recovered in software, and if you don't have
STONITH, this can bring down your cluster.
Further, in the case of a node failure, STONITH is used to ensure the
failed nodes/minority partition is truly dead before starting services
within the quorate partition.
You may say "But I'm using drbd so what do I care, it'll just resync",
and that would be mostly true, but:
1. In theory, the replication link could still be up even though
OpenAIS/Pacemaker think the node is dead. This could cause the dreaded
dual access to the same shared storage.
2. The dying nodes might still hold a connection to client processes.
They might continue writing locally, and _confirming the writes to the
clients_. These would be overwritten on resync. While the image would be
consistent afterwards, you have just lost transactional integrity, which
is generally considered a bad thing.
3. Somewhat obscure, the dieing node might continue writing locally
(consider a haywire project), which increases the amount of data in need
of resyncing.
4. STONITH'ing the failed node might allow it to recover from transient
errors, bring up the replication again, and reduce the time in degraded
mode w/o redundancy.
So in general, fencing/STONITH is a really really good idea.
Regards,
Lars
--
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems