Am 12.08.2010 04:12, schrieb Cnut Jansen: > Basically I have a cluster of 2 nodes with cloned DLM-, O2CB-, DRBD-, > mount-resources, and a MySQL-resource (grouped with an IPaddr-resource) > running on top of the other ones. > The MySQL(-group)-resource depends on the mount-resource, which depends > on both, the DRBD- and the O2CB-resources equally, and the O2CB-resource > depends on the DLM-resource. > cloneDlm -> cloneO2cb -\ > }-> cloneMountMysql -> mysql / grpMysql( mysql > -> ipMysql ) > msDrbdMysql -----------/ > Furthermore for the MySQL(-group)-resource I set meta-attributes > "migration-threshold=1" and "failure-timeout=90" (later also tried > settings "3" and "130" for these).
> Now through a lot of testing I found out that: > a) the stops/restarts of the underlying resources happen only when > failcounter hits the limit set by migration-threshold; i.e. when set to > 3, on first 2 failures only mysql/grpMysql is restarted on the same node > and only on 3rd one underlying resources are left in a mess (while > mysql/grpMysql migrates) (for DRBD reproducable; unsure about > DLM/O2CB-side, but there's sometimes hard trouble too after having > picked on mysql; just couldn't definitively link it yet) > b) upon causing mysql/grpMysql's migration, score for > msDrbdMysql:promote changes from 10020 to -inf and stays there for the > time of mysql/grpMysql's failure-timeout (proved with also setting to > 130), before it rises back up to 10000 > c) msDrbdMysql remains slave until the next cluster-recheck after its > promote-score went back up to 10000 > d) I also have the impression that fail-counters don't get reset after > their failure-timeout, because when migration-threshold=3 is set, upon > every(!) following picking-on those issues occure, even when I've waited > for nearly 5 minutes (with failure-timeout=90) without any touching the > cluster > > I experienced this on both test-clusters, a SLES 11 HAE SP1 with > Pacemaker 1.1.2, and a Debian Squeeze with Pacemaker 1.0.9. When > migration-threshold for mysql/grpMysql is removed, everything is fine > (except no migration of course). I can't remember such happening with > SLES 11 HAE SP0's Pacemaker 1.0.6. > p.s.: Just for fun / testing / proving I just also contrainted > grpLdirector to cloneMountShared... and could perfectly reproduce that > problem with its then underlying resources too. For reference: SLES11-HAE-SP1: Issues seem to be solved with latest officially released packages (upgraded yesterday directly from Novell's repositories), including Pacemaker version 1.1.2-0.6.1 (Arch: x86_64), shown in crm_mon as "1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b". At least so far I couldn't reproduce any unnecessary restart of underlying resources (nor any other touching them at all), and fail-counters now - after failure-timeout is over - get reset upon next cluster-recheck (event- or interval-driven). Debian Squeeze: Not tested again yet _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker