On 2013-07-01T16:31:13, William Seligman <[email protected]> wrote:

> a) people can exclaim "You fool!" and point out all the stupid things I did 
> wrong;
> 
> b) sysadmins who are contemplating the switch to HA have additional points to
> add to the pros and cons.

I think you bring up an important point that I also try to stress when I
talk to customers: HA is not for everyone, since it's not a magic
bullet. HA environments can protect against hardware faults (and some
operator issues and failing software), but they need to be carefully
managed and designed. They don't come for free, and the complexity can
be a deterrent.

(While the "Complex, but not complicated" is a good goal, it's not
easily achieved.)

And why the additional recovery plans should always include Plan C: "How
do I bring my services online manually without a cluster stack"?

> I'll mention this one first because it's the most recent, and it was the straw
> that broke the camel's back as far as the users were concerned.
> 
> Last week, cman crashed, and the cluster stopped working. There was no clear
> message in the logs indicating why. I had no time for archeology, since the
> crash happened in middle of our working day; I rebooted everything and cman
> started up again just fine.

Such stuff happens; without archaeology, we're unlikely to be able to
fix them ;-) I take it though you're not running the latest supported
versions and don't have good support contracts; that's really
important.

And, of course, why we strive to produce support tools that allow first
failure data capture - so we can get a full overview of the log files
and what triggered whatever problem the system encountered, without
needing to reproduce.  (crm_/hb_report, qb_blackbox, etc.)

> Problems under heavy server load:
> 
> Let's call the two nodes on the cluster A and B. Node A starts running a 
> process
> that does heavy disk writes to the shared DRBD volume. The load on A starts to
> rise. The load on B rises too, more slowly, because the same blocks must be
> written to node B's disk.
> 
> Eventually the load on A grows so great that cman+clvmd+pacemaker does not
> respond promptly, and node B stoniths node A. The problem is that the DRBD
> partition on node B marked "Inconsistent". All the other resources in the
> pacemaker configuration depend on DRBD, so none of them are allowed to run.

This shouldn't happen. The cluster stack is supposed to be isolated from
the "mere" workload via realtime scheduling/IO priority and locking
itself into memory. Or you had too short timeouts for the monitoring
services.

(I have noticed a recent strive to drop the SCHED_RR priority from
processes, because that seemingly makes some problems go away. But
personally, I think that just masks the issue of priority inversion in
the message layers somewhere, and isn't a proper fix; exposing us to
more situations as described here instead.)

But even that shouldn't lead to stonith directly, but to resources being
stopped. Only if that fails would it then cause a fence.

And - a fence also shouldn't make DRBD inconsistent like this. Is your
DRBD set up properly?

> "Poisoned resource"
> 
> This is the one you can directly attribute to my stupidity.
> 
> I add a new resource to the pacemaker configuration. Even though the pacemaker
> configuration is syntactically correct, and even though I think I've tested 
> it,
> in fact the resource cannot run on either node.
> 
> The most recent example: I created a new virtual domain and tested it. It 
> worked
> fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
> could parse it, and activated the configuration. However, I had not actually
> created the domain for the virtual machine; I had typed "virsh create ..." but
> not "virsh define ...".
> 
> So I had a resource that could not run. What I'd want to happen is for the
> "poisoned" resource to fail, I see lots of error messages, but the remaining
> resources would continue to run.
> 
> What actually happens is that resource tries to run on both nodes alternately 
> an
> "infinite" number of times (10000 times or whatever the value is). Then one of
> the nodes stoniths the other. The poisoned resource still won't run on the
> remaining node, so that node tries restarting all the other resources in the
> pacemaker configuration. That still won't work.

Yeah, so, you describe a real problem here.

"Poisoned resources" indeed should just fail to start and that should be
that. What instead can happen is that the resource agent notices it
can't start, reports back to the cluster, and the cluster manager goes
"Oh no, I couldn't start the resource successfully! It's now possibly in
a weird state and I better stop it!"

... And because of the misconfiguration, the *stop* also fails, and
you're hit with the full power of node-level recovery.

I think this is an issue with some resource agents (if the parameters
are so bad that the resource couldn't possibly have started, why fail
the stop?) and possibly also something where one could contemplate a
better on-fail="" default for "stop in response to first-start failure".

I've seen production clusters explode with this, and it's not funny.
:-/

(And "poisoned resource" is a nice term for this problem, thanks for
that.)


Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to