Hi, On Sat, Aug 14, 2010 at 06:26:58AM +0200, Cnut Jansen wrote: > Hi, > > and first of all thanks for answering so far. > > > Am 12.08.2010 18:46, schrieb Dejan Muhamedagic: > > > >The migration-threshold shouldn't in any way influence resources > >which don't depend on the resource which fails over. Couldn't > >reproduce it here with our example RAs. > Well, I now - just to clearly assure that something's wrong there; > whatever it is, simple misconfiguration or possible bug - did crm > configure erase, completely restarted both nodes, and then setup > this new, very simple, dummy-based configuration: > v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v > v v v v v v > node alpha \ > attributes standby="off" > node beta \ > attributes standby="off" > primitive dlm ocf:heartbeat:Dummy > primitive drbd ocf:heartbeat:Dummy > primitive mount ocf:heartbeat:Dummy > primitive mysql ocf:heartbeat:Dummy \ > meta migration-threshold="3" failure-timeout="40" > primitive o2cb ocf:heartbeat:Dummy > location cli-prefer-mount mount \ > rule $id="cli-prefer-rule-mount" inf: #uname eq alpha > colocation colocMysql inf: mysql mount > order orderMysql inf: mount mysql > property $id="cib-bootstrap-options" \ > dc-version="1.0.9-unknown" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > cluster-recheck-interval="150" \ > last-lrm-refresh="1281751924" > ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ > ^ ^ ^ ^ ^ ^ > ...and then, with picking on the resource "mysql", got this: > > 1) alpha: FC(mysql)=0, crm_resource -F -r mysql -H alpha > Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_asyncmon_0 (call=48, rc=1, cib-update=563, > confirmed=false) unknown error > Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=49, rc=0, cib-update=565, > confirmed=true) ok > Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_start_0 (call=50, rc=0, cib-update=567, > confirmed=true) ok > > 2) alpha: FC(mysql)=1, crm_resource -F -r mysql -H alpha > Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_asyncmon_0 (call=51, rc=1, cib-update=568, > confirmed=false) unknown error > Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=52, rc=0, cib-update=572, > confirmed=true) ok > Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_start_0 (call=53, rc=0, cib-update=573, > confirmed=true) ok > > 3) alpha: FC(mysql)=2, crm_resource -F -r mysql -H alpha > Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_asyncmon_0 (call=54, rc=1, cib-update=574, > confirmed=false) unknown error > Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=55, rc=0, cib-update=576, > confirmed=true) ok > Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM > operation mount_stop_0 (call=56, rc=0, cib-update=578, > confirmed=true) ok > beta: (FC(mysql)=3 > Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM > operation mount_start_0 (call=36, rc=0, cib-update=92, > confirmed=true) ok > Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM > operation mysql_start_0 (call=37, rc=0, cib-update=93, > confirmed=true) ok > Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=38, rc=0, cib-update=94, > confirmed=true) ok > Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM > operation mount_stop_0 (call=39, rc=0, cib-update=95, > confirmed=true) ok > alpha: FC(mysql)=3 > Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM > operation mount_start_0 (call=57, rc=0, cib-update=580, > confirmed=true) ok > Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_start_0 (call=58, rc=0, cib-update=581, > confirmed=true) ok > > > So it seems that - for what reason ever - those constrainted > resources are considered and treated just as they were in a > resource-group, because they move to where they all can run, instead > of the "eat or die" for the dependent resource (mysql) to the > underlying resource (mount) that I had expected with such > constraints as I set them... shouldn't I?! o_O
Yes, those two constraints are equivalent to a group. > And - concerning the failure-timeout - quite a while later, without > having resetted mysql's failure counter or having done anything else > in the meantime: > > 4) alpha: FC(mysql)=3, crm_resource -F -r mysql -H alpha > Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_asyncmon_0 (call=59, rc=1, cib-update=592, > confirmed=false) unknown error > Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=60, rc=0, cib-update=596, > confirmed=true) ok > Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM > operation mount_stop_0 (call=61, rc=0, cib-update=597, > confirmed=true) ok > beta: FC(mysql)=0 > Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM > operation mount_start_0 (call=40, rc=0, cib-update=96, > confirmed=true) ok > Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM > operation mysql_start_0 (call=41, rc=0, cib-update=97, > confirmed=true) ok > Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM > operation mysql_stop_0 (call=42, rc=0, cib-update=98, > confirmed=true) ok > Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM > operation mount_stop_0 (call=43, rc=0, cib-update=99, > confirmed=true) ok > alpha: FC(mysql)=4 > Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM > operation mount_start_0 (call=62, rc=0, cib-update=599, > confirmed=true) ok > Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM > operation mysql_start_0 (call=63, rc=0, cib-update=600, > confirmed=true) ok This worked as expected, i.e. after the 150s cluster-recheck interval the resources were started at alpha. > >BTW, what's the point of cloneMountMysql? If it can run only > >where drbd is master, then it can run on one node only: > > > >colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master > >order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start > It's a dual-primary-DRBD-configuration, so there are actually - when > everything is ok (-; - 2 masters of each DRBD-multistate-resource... > even though I admit that at least the dual primary respectively > master for msDrbdMysql is currently (quite) redundant, since in the > current cluster configuration there's only one, primitive > MySQL-resource and thus there'd be no inevitable need for MySQL's > data-dir being mounted all time on both nodes. > But since it's not harmful to have it mounted on the other node too, > and since msDrbdOpencms and msDrbdShared need to be mounted on both > nodes and since I put the complete installation and configuration of > the cluster into flexibly configurable shell-scripts, it's easier > respectively done with less typing to just put all DRBD- and > mount-resources' configuration into just one common loop. (-; OK. It did cross my mind that it may be a dual-master drbd. Your configuration is large. If you are going to run that in producetion and don't really need a dual-master, then it'd be good to get rid of the ocfs2 bits to make maintenance easier. > >>d) I also have the impression that fail-counters don't get reset > >>after their failure-timeout, because when migration-threshold=3 is > >>set, upon every(!) following picking-on those issues occure, even > >>when I've waited for nearly 5 minutes (with failure-timeout=90) > >>without any touching the cluster > >That seems to be a bug though I couldn't reproduce it with a > >simple configuration. > I just also tested this once again: It seems like that > failure-timeout only sets back scores from -inf to around 0 > (whereever they should normally be), allowing the resources to > return back to the node. I tested with setting a location constraint > for the underlying resource (see configuration): After the > failure-timeout has been completed, on the next cluster-recheck (and > only then!) the underlying resource and its dependants return to the > underlying resource's prefered location, as you see in logs above. The count gets reset, but the cluster acts on it only after the cluster-recheck-interval, unless something else makes the cluster calculate new scores. Thanks, Dejan > > > > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker