Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence of a situation where I add a resource and start it and it seems to start but then right away fail. i.e.
# clean up resource before trying to start, just to make sure we start with a clean slate # crm resource cleanup testfs-resource1 Cleaning up testfs-resource1 on node1 Waiting for 2 replies from the CRMd.. OK # now try to start it # crm_resource -r testfs-resource1 -p target-role -m -v Started # monitor teh start up for success # crm resource status testfs-resource1: resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running ... # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is running on: node1 # it started. check once more: # crm status ============ Last updated: Tue May 7 02:37:34 2013 Last change: Tue May 7 02:36:17 2013 via crm_resource on node1 Stack: openais Current DC: node1 - partition WITHOUT quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 1 Nodes configured, 2 expected votes 3 Resources configured. ============ Online: [ node1 ] st-fencing (stonith:fence_foo): Started node1 resource2 (ocf::foo:Target): Started node1 testfs-resource1 (ocf::foo:Target): Started node1 FAILED Failed actions: testfs-resource1_monitor_0 (node=node1, call=-1, rc=1, status=Timed Out): unknown error # but lo and behold, it failed, with a monitor operation failing. # stop it # crm_resource -r testfs-resource1 -p target-role -m -v Stopped: 0 The syslog for this whole operation, starting with adding the resource is as follows: May 7 02:36:12 node1 cib[16831]: info: cib:diff: - <cib admin_epoch="0" epoch="15" num_updates="4" /> May 7 02:36:12 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.16.1) : Non-status change May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <cib epoch="16" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="node1" update-client="crm_resource" cib-last-written="Tue May 7 02:35:56 2013" have-quorum="0" dc-uuid="node1" > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <configuration > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <resources > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <primitive class="ocf" provider="foo" type="Target" id="testfs-resource1" __crm_diff_marker__="added:top" > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <meta_attributes id="testfs-resource1-meta_attributes" > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <nvpair name="target-role" id="testfs-resource1-meta_attributes-target-role" value="Stopped" /> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </meta_attributes> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <operations id="testfs-resource1-operations" > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <op id="testfs-resource1-monitor-5" interval="5" name="monitor" timeout="60" /> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <op id="testfs-resource1-start-0" interval="0" name="start" timeout="300" /> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <op id="testfs-resource1-stop-0" interval="0" name="stop" timeout="300" /> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </operations> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <instance_attributes id="testfs-resource1-instance_attributes" > May 7 02:36:12 node1 cib[16831]: info: cib:diff: + <nvpair id="testfs-resource1-instance_attributes-target" name="target" value="364cfbf8-26dc-44c9-98ad-f8f9d0fafd9a" /> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </instance_attributes> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </primitive> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </resources> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </configuration> May 7 02:36:12 node1 cib[16831]: info: cib:diff: + </cib> May 7 02:36:12 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_create for section resources (origin=local/cibadmin/2, version=0.16.1): ok (rc=0) May 7 02:36:12 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 7 02:36:12 node1 crmd[16836]: info: do_te_invoke: Processing graph 13 (ref=pe_calc-dc-1367894172-72) derived from /var/lib/pengine/pe-input-124.bz2 May 7 02:36:12 node1 crmd[16836]: info: te_rsc_command: Initiating action 5: monitor testfs-resource1_monitor_0 on node1 (local) May 7 02:36:12 node1 lrmd: [16833]: info: rsc:testfs-resource1:8: probe May 7 02:36:12 node1 pengine[16835]: notice: process_pe_message: Transition 13: PEngine Input stored in: /var/lib/pengine/pe-input-124.bz2 May 7 02:36:14 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.17.1) : Non-status change May 7 02:36:14 node1 cib[16831]: info: cib:diff: - <cib admin_epoch="0" epoch="16" num_updates="1" /> May 7 02:36:14 node1 cib[16831]: info: cib:diff: + <cib epoch="17" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="node1" update-client="cibadmin" cib-last-written="Tue May 7 02:36:12 2013" have-quorum="0" dc-uuid="node1" > May 7 02:36:14 node1 cib[16831]: info: cib:diff: + <configuration > May 7 02:36:14 node1 cib[16831]: info: cib:diff: + <constraints > May 7 02:36:14 node1 cib[16831]: info: cib:diff: + <rsc_location id="testfs-resource1-primary" node="node1" rsc="testfs-resource1" score="20" __crm_diff_marker__="added:top" /> May 7 02:36:14 node1 cib[16831]: info: cib:diff: + </constraints> May 7 02:36:14 node1 cib[16831]: info: cib:diff: + </configuration> May 7 02:36:14 node1 cib[16831]: info: cib:diff: + </cib> May 7 02:36:14 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_create for section constraints (origin=local/cibadmin/2, version=0.17.1): ok (rc=0) May 7 02:36:16 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='testfs-resource1'] (origin=local/crmd/249, version=0.17.1): ok (rc=0) May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1 May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 May 7 02:36:16 node1 crmd[16836]: info: notify_deleted: Notifying 18002_crm_resource on node1 that testfs-resource1 was deleted May 7 02:36:16 node1 crmd[16836]: warning: decode_transition_key: Bad UUID (crm-resource-18002) in sscanf result (3) for 0:0:crm-resource-18002 May 7 02:36:16 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='testfs-resource1'] (origin=local/crmd/250, version=0.17.2): ok (rc=0) May 7 02:36:16 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.18.1) : Non-status change May 7 02:36:16 node1 cib[16831]: info: cib:diff: - <cib admin_epoch="0" epoch="17" num_updates="2" > May 7 02:36:16 node1 cib[16831]: info: cib:diff: - <configuration > May 7 02:36:16 node1 cib[16831]: info: cib:diff: - <crm_config > May 7 02:36:16 node1 cib[16831]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" > May 7 02:36:16 node1 cib[16831]: info: cib:diff: - <nvpair value="1367894155" id="cib-bootstrap-options-last-lrm-refresh" /> May 7 02:36:16 node1 cib[16831]: info: cib:diff: - </cluster_property_set> May 7 02:36:16 node1 cib[16831]: info: cib:diff: - </crm_config> May 7 02:36:16 node1 cib[16831]: info: cib:diff: - </configuration> May 7 02:36:16 node1 cib[16831]: info: cib:diff: - </cib> May 7 02:36:16 node1 cib[16831]: info: cib:diff: + <cib epoch="18" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="node1" update-client="cibadmin" cib-last-written="Tue May 7 02:36:14 2013" have-quorum="0" dc-uuid="node1" > May 7 02:36:16 node1 cib[16831]: info: cib:diff: + <configuration > May 7 02:36:16 node1 cib[16831]: info: cib:diff: + <crm_config > May 7 02:36:16 node1 cib[16831]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" > May 7 02:36:16 node1 cib[16831]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1367894176" /> May 7 02:36:16 node1 cib[16831]: info: cib:diff: + </cluster_property_set> May 7 02:36:16 node1 cib[16831]: info: cib:diff: + </crm_config> May 7 02:36:16 node1 cib[16831]: info: cib:diff: + </configuration> May 7 02:36:16 node1 cib[16831]: info: cib:diff: + </cib> May 7 02:36:16 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/252, version=0.18.1): ok (rc=0) May 7 02:36:16 node1 crmd[16836]: info: ais_dispatch_message: Membership 4: quorum still lost May 7 02:36:16 node1 crmd[16836]: info: crmd_ais_dispatch: Setting expected votes to 2 May 7 02:36:16 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/255, version=0.18.2): ok (rc=0) May 7 02:36:16 node1 crmd[16836]: warning: do_update_resource: Resource testfs-resource1 no longer exists in the lrmd May 7 02:36:16 node1 crmd[16836]: info: process_lrm_event: LRM operation testfs-resource1_monitor_0 (call=8, rc=7, cib-update=0, confirmed=false) not running May 7 02:36:16 node1 crmd[16836]: info: update_history_cache: Resource testfs-resource1 no longer exists, not updating cache May 7 02:36:17 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.19.1) : Non-status change May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <cib admin_epoch="0" epoch="18" num_updates="2" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <configuration > May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <resources > May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <primitive id="testfs-resource1" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <meta_attributes id="testfs-resource1-meta_attributes" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: - <nvpair value="Stopped" id="testfs-resource1-meta_attributes-target-role" /> May 7 02:36:17 node1 cib[16831]: info: cib:diff: - </meta_attributes> May 7 02:36:17 node1 cib[16831]: info: cib:diff: - </primitive> May 7 02:36:17 node1 cib[16831]: info: cib:diff: - </resources> May 7 02:36:17 node1 cib[16831]: info: cib:diff: - </configuration> May 7 02:36:17 node1 cib[16831]: info: cib:diff: - </cib> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <cib epoch="19" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="node1" update-client="crmd" cib-last-written="Tue May 7 02:36:16 2013" have-quorum="0" dc-uuid="node1" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <configuration > May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <resources > May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <primitive class="ocf" provider="foo" type="Target" id="testfs-resource1" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <meta_attributes id="testfs-resource1-meta_attributes" > May 7 02:36:17 node1 cib[16831]: info: cib:diff: + <nvpair name="target-role" id="testfs-resource1-meta_attributes-target-role" value="Started" /> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + </meta_attributes> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + </primitive> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + </resources> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + </configuration> May 7 02:36:17 node1 cib[16831]: info: cib:diff: + </cib> May 7 02:36:17 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_modify for section resources (origin=local/crm_resource/4, version=0.19.1): ok (rc=0) May 7 02:37:32 node1 crmd[16836]: warning: action_timer_callback: Timer popped (timeout=20000, abort_level=1000000, complete=false) May 7 02:37:32 node1 crmd[16836]: error: print_elem: Aborting transition, action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: node1, priority: 0) May 7 02:37:32 node1 crmd[16836]: info: abort_transition_graph: action_timer_callback:535 - Triggered transition abort (complete=0) : Action lost May 7 02:37:32 node1 crmd[16836]: warning: cib_action_update: rsc_op 5: testfs-resource1_monitor_0 on node1 timed out May 7 02:37:32 node1 crmd[16836]: info: create_operation_update: cib_action_update: Updating resouce testfs-resource1 after Timed Out monitor op (interval=0) May 7 02:37:32 node1 crmd[16836]: info: te_rsc_command: Initiating action 4: probe_complete probe_complete on node1 (local) - no waiting May 7 02:37:32 node1 crmd[16836]: notice: run_graph: ==== Transition 13 (Complete=2, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-124.bz2): Stopped May 7 02:37:32 node1 crmd[16836]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] May 7 02:37:32 node1 crmd[16836]: info: process_graph_event: Action testfs-resource1_monitor_0 arrived after a completed transition May 7 02:37:32 node1 crmd[16836]: info: abort_transition_graph: process_graph_event:481 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=testfs-resource1_last_failure_0, magic=2:1;5:13:7:5d728e81-4496-4eac-9f3f-ac6d21163ad7, cib=0.19.2) : Inactive graph May 7 02:37:32 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:37:32 node1 pengine[16835]: warning: unpack_rsc_op: Processing failed op testfs-resource1_last_failure_0 on node1: unknown error (1) May 7 02:37:32 node1 pengine[16835]: notice: LogActions: Recover testfs-resource1#011(Started node1) May 7 02:37:32 node1 crmd[16836]: info: handle_response: pe_calc calculation pe_calc-dc-1367894252-77 is obsolete May 7 02:37:32 node1 pengine[16835]: notice: process_pe_message: Transition 14: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2 May 7 02:37:32 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:37:32 node1 pengine[16835]: warning: unpack_rsc_op: Processing failed op testfs-resource1_last_failure_0 on node1: unknown error (1) May 7 02:37:32 node1 pengine[16835]: notice: LogActions: Recover testfs-resource1#011(Started node1) May 7 02:37:32 node1 crmd[16836]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 7 02:37:32 node1 crmd[16836]: info: do_te_invoke: Processing graph 15 (ref=pe_calc-dc-1367894252-78) derived from /var/lib/pengine/pe-input-125.bz2 May 7 02:37:32 node1 crmd[16836]: info: te_rsc_command: Initiating action 2: stop testfs-resource1_stop_0 on node1 (local) May 7 02:37:32 node1 lrmd: [16833]: info: rsc:testfs-resource1:9: stop May 7 02:37:32 node1 pengine[16835]: notice: process_pe_message: Transition 15: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2 May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <cib admin_epoch="0" epoch="19" num_updates="2" > May 7 02:37:34 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.20.1) : Non-status change May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <configuration > May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <resources > May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <primitive id="testfs-resource1" > May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <meta_attributes id="testfs-resource1-meta_attributes" > May 7 02:37:34 node1 cib[16831]: info: cib:diff: - <nvpair value="Started" id="testfs-resource1-meta_attributes-target-role" /> May 7 02:37:34 node1 cib[16831]: info: cib:diff: - </meta_attributes> May 7 02:37:34 node1 cib[16831]: info: cib:diff: - </primitive> May 7 02:37:34 node1 cib[16831]: info: cib:diff: - </resources> May 7 02:37:34 node1 cib[16831]: info: cib:diff: - </configuration> May 7 02:37:34 node1 cib[16831]: info: cib:diff: - </cib> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <cib epoch="20" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="node1" update-client="crm_resource" cib-last-written="Tue May 7 02:36:17 2013" have-quorum="0" dc-uuid="node1" > May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <configuration > May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <resources > May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <primitive class="ocf" provider="foo" type="Target" id="testfs-resource1" > May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <meta_attributes id="testfs-resource1-meta_attributes" > May 7 02:37:34 node1 cib[16831]: info: cib:diff: + <nvpair name="target-role" id="testfs-resource1-meta_attributes-target-role" value="Stopped" /> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + </meta_attributes> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + </primitive> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + </resources> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + </configuration> May 7 02:37:34 node1 cib[16831]: info: cib:diff: + </cib> May 7 02:37:34 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_modify for section resources (origin=local/crm_resource/4, version=0.20.1): ok (rc=0) May 7 02:37:36 node1 crmd[16836]: info: process_lrm_event: LRM operation testfs-resource1_stop_0 (call=9, rc=0, cib-update=259, confirmed=true) ok May 7 02:37:36 node1 crmd[16836]: notice: run_graph: ==== Transition 15 (Complete=1, Pending=0, Fired=0, Skipped=3, Incomplete=0, Source=/var/lib/pengine/pe-input-125.bz2): Stopped May 7 02:37:36 node1 crmd[16836]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] May 7 02:37:36 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:37:36 node1 pengine[16835]: warning: unpack_rsc_op: Processing failed op testfs-resource1_last_failure_0 on node1: unknown error (1) May 7 02:37:36 node1 crmd[16836]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 7 02:37:36 node1 crmd[16836]: info: do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1367894256-80) derived from /var/lib/pengine/pe-input-126.bz2 May 7 02:37:36 node1 crmd[16836]: notice: run_graph: ==== Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-126.bz2): Complete May 7 02:37:36 node1 crmd[16836]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] May 7 02:37:36 node1 pengine[16835]: notice: process_pe_message: Transition 16: PEngine Input stored in: /var/lib/pengine/pe-input-126.bz2 I do see the: May 7 02:37:32 node1 crmd[16836]: error: print_elem: Aborting transition, action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: node1, priority: 0) in the log. Is that the root cause of the problem? If so, what's that trying to tell me, exactly? If not, what is the cause of the problem? It really can't be the RA timing out since I give the monitor operation a 60 second timeout and the status action of the RA only take a few seconds at most to run and is not really an operation that can get blocked on anything. It's effectively the grepping of a file. Cheers, b.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org