Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

Ulrich Windl Thu, 16 Aug 2012 23:20:05 -0700

>>> "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)"
<[email protected]> schrieb am 16.08.2012 um 17:54 in
Nachricht
<[email protected]>:


[...]
> From my experience with SLES11 SP2 (with all current updates) I conclude 
> that actually nobody is seriously running SP2 without local bugfixes.

Unfortunately that's ture for SP1 as well: We had to use a newer corosync 
(among others)

> 
> E.g. Even the most simple examples from the official SuSE documentation 
> don't work as expected.
> 
> A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 
> causes unlimited growth of .rmtab files (goes fast in the gigabytes for 
> serious NFS servers). I could work around this issue using some shell 
> scripting.

Yes, we had that, too for SP1. Fixed in 
"resource-agents-3.9.2-0.4.2.1.4061.0.PTF.754067" (just for reference). 
Unfortunately the problem only shows up when seriously using the NFS server.

> 
> There are other issues which are more than annoying and actually make the 
> SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot be 
> made less verbose from the cluster configuration. (No daemon_options="-d0" 
> does not help!)

I haven't tried it, but it's on the agenda.

> 
> Not funny is also the fact that the official SLES 11 SP2 kernels crash 
> seriously (when a node rejoins the cluster) when using STCP as recommended in 
> the SLES HA documentation and offered via the wizards. It took me a while to 
> find out what was going on.

No we did not have these bugs, but we had a crashing crmd, and a two-node 
cluster that could not agree who's DC for several minutes.

> 
> When setting up a system with many (rather simple) resources funny things 
> happen due to race conditions all over the place. (can be worked around 
> mostly using arbitrary start-delay options.
> 
> Oh, did I mention that situations which are actually forbidden by 
> constraints (e.g. using a score of INFINITY) actually do happen... Depending 
> on the environment this can lead to not so funny effects.
> 
> E.g. I defined the following constraints:
> 
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten
> order o34 inf: p_fs_daten p_lsb_ccslogserver:start
> 
> I can proof from the logs that ccslogserver (an application) got migrated 
> from node A to node B while p_fs_daten (a filesystem on top of drbd) was 
> definitely still running on node A

I'm absolutely no expert on that, but I think you constraints will allow 
p_fs_daten to be active on one node while p_lsb_ccslogserver is going down 
(being migrated). Only before staring p_lsb_ccslogserver p_fs_daten should be 
up. Probably then the colocation is ignored.

I'm also unsure whether transitive ordering an colocation works.

What also disappointed me: When adding stickiness to a primitive, a group gets 
more or less the sum of ist primitives, but when you add a stickiness to a 
goup, EVERY primitive gets that stickiness, and the group STILL gets the sum of 
all these then. So especially bad, when adding one more primitive to a group 
the total stickiness changes.

Likewise if you use resource utilization on primitives in a group, the group 
begains to start on one node, then stalls when the next primitive's utilization 
cannot be fulfilled. That's bad especially when there are enough resources for 
the whole group on another node. (Here ulilizations are not summed).

Some concepts had been implemented very "ad hoc".

And one of the popular clusterbooks describes the XML configuration. It's like 
describing how to start the engine of your car: Open the hood, locate the 
battery and the starter engine. The take a pair of wires, connecting one end to 
the battery, and the other end to the starter engine, watching for right 
polarity, Then... (you get it)

The best tool around is the crm shell (IMHO), while the GUI has extraordinarily 
poor performance once your cluster has a reasonable number of resources.

There is a acess control concept (ACLs) based on XPath. Unfortunately that 
would require to exactly describe the data model of the CIB to really implement 
proven access restrictions. It's a bit complicated...

> 
> Reporting bugs is not possible without a direct support contract. (You must 
> enter into a support contract with SuSE before you can even report a bug or 
> provide a patch ....)

Yes: I found out that there is no mechanism to repair non-clustered MD-RAIDs, 
so I wrote a RAID monitor. Proposed that to support. Still didn't hear any 
feedback about it...

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

Reply via email to