>> Hi, >> I have a resource that sometimes can take 10 minutes to start after
>> a failure due to log records that need to be sync'd. (my own OCF) >> >> I noticed while the start action was being performed, if other >> >> resources in my cluster report a "not running", no restart will be >> attempted until my long running started resource returns. >> >> Meanwhile, the crm_mon reports the resources as "started" >> eventhough they are not running, and may not be for many minutes. >> Is the lrm process single threaded? >You are saying that while your RA starts (with a long start timeout), >and the start action is not yet complete, >other _independend_ resources are not yet started, >but crm_mon thinks they are running already, >even though "something" (what?) reports "not running" for those? Yes, I am saying that if a resource (R1) is taking a long time to start and another resource (R2) monitor action returns a not running, it will not be restarted until the first stuck resource returns or in my case times out. Since the stop action has not been run on R2, crm_mon still says "Started" >I think you lost me ;) >please show a "crm configure show" primitive dummy-1 ocf:heartbeat:Dummy \ op monitor interval="30s" \ op start interval="0" timeout="90s" migration-threshold="0" primitive dummy-main ocf:heartbeat:Dummy \ op monitor interval="30s" \ op start interval="0" timeout="30s" \ meta migration-threshold="0" target-role="Started" primitive dummy-sleep ocf:heartbeat:DummySleep \ op monitor interval="60s" \ op start interval="0" timeout="2m" \ meta migration-threshold="0" target-role="Started" colocation d inf: dummy-sleep dummy-main colocation d1 inf: dummy-1 dummy-main property $id="cib-bootstrap-options" \ dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ last-lrm-refresh="1271853339 >Can you reproduce this easily? Not easily, but I finally have the correct combination. In my case I have dependent resources, but I was able to reproduce part of the issue using the Dummy resource. >Can you reproduce this with just a few "Dummy" resources? I added an ocf_log to the monitor action so I could tail the messages file to see what was happening. I created another resource "DummySleep" where I inserted a sleep as follows: dummy_start() { ocf_log info "OCF_RESKEY_state is ${OCF_RESKEY_state}" dummy_monitor ret=$? ocf_log info "dummy start sleep..." sleep 3000 return $OCF_ERR_GENERIC ocf_log info "dummy start sleep return..." if [ $ret = $OCF_SUCCESS ]; then return $OCF_SUCCESS fi touch ${OCF_RESKEY_state} I ran the test as follows: 1) Commented out the sleep and return to get the DummySleep resource going with the others 2) Replaced the DummySleep OCF with sleep turned on 3) Ran crm resource stop dummy-sleep 4) Crm resource start dummy-sleep to cause it to sleep 5) "rm" the state file for dummy-main to cause the failure You'll notice that the monitor for dummy-main keeps going even though it's not running until the dummy-sleep resource is woken up. Apr 21 10:06:56 qpr1 lrmd: [30826]: info: RA output: (dummy-sleep:start:stderr) 2010/04/21_10:06:56 INFO: dummy start sleep... Apr 21 10:07:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:01 INFO: dummy monitor Apr 21 10:07:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:26 INFO: dummy monitor Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: dummy monitor Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: Not Running Apr 21 10:07:31 qpr1 crmd: [30829]: info: process_lrm_event: LRM operation dummy-main_monitor_30000 (call=135, rc=7, cib-update=204, confirmed=false) not running Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2 Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_local_callback: Expanded fail-count-dummy-main=value++ to 7 Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-dummy-main (7) Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 107: fail-count-dummy-main=7 Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2 Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-dummy-main (1271858866) Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 109: last-failure-dummy-main=1271858866 Apr 21 10:07:56 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:56 INFO: dummy monitor Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: dummy monitor Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running Apr 21 10:08:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:08:26 INFO: dummy monitor Apr 21 10:08:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:31 INFO: dummy monitor Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf