I do not know where this timeout is coming from. It appears to be independent of any cluster settings.
I've tested causing a node to be marked UNCLEAN by messing with a VirtualDomain RA. Expected results: STONITH Observed results: Timeout. No STONITH. sdgxen-2 logs: Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: pe_fence_node: Node sdgxen-3 wil l be fenced to recover from resource failure(s) Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: common_apply_stickiness: Forcing clone-VD away from sdgxen-2 after 1000000 failures (max=1000000) Nov 30 15:26:06 sdgxen-2 sbd: [1127]: info: Latency: 1 Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: common_apply_stickiness: Forcing clone-VD away from sdgxen-3 after 1000000 failures (max=1000000) Nov 30 15:26:06 sdgxen-2 pengine: [1142]: ERROR: native_create_actions: Resource clone-VD (ocf::VirtualDomain) is active on 2 nodes attempting recovery Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: stage6: Scheduling Node sdgxen-2 for STONITH Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: native_stop_constraints: Stop of failed resource clone-VD is implicit after sdgxen-2 is fenced Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: stage6: Scheduling Node sdgxen-3 for STONITH Nov 30 15:26:06 sdgxen-2 pengine: [1142]: WARN: native_stop_constraints: Stop of failed resource clone-VD is implicit after sdgxen-3 is fenced Nov 30 15:26:06 sdgxen-2 pengine: [1142]: notice: LogActions: Stop clone-VD (sdgxen-2) Nov 30 15:26:06 sdgxen-2 pengine: [1142]: notice: LogActions: Stop clone-VD (sdgxen-3) Nov 30 15:26:06 sdgxen-2 pengine: [1142]: notice: LogActions: Stop stonith-sbd (sdgxen-2) Nov 30 15:26:06 sdgxen-2 pengine: [1142]: ERROR: process_pe_message: Transition 38: ERRORs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-error-144.bz2 Nov 30 15:26:06 sdgxen-2 pengine: [1142]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: unpack_graph: Unpacked transition 38: 8 actions in 8 synapses Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: do_te_invoke: Processing graph 38 (ref=pe_calc-dc-1322666766-106) derived from /var/lib/pengine/pe-error-144.bz2 Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: te_pseudo_action: Pseudo action 4 fired and confirmed Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: te_pseudo_action: Pseudo action 5 fired and confirmed Nov 30 15:26:06 sdgxen-2 stonith-ng: [1138]: info: initiate_remote_stonith_op: Initiating remote operation reboot for sdgxen-3: 18cffdca-d3b1-4867-ab91-b3d29e5a06ba Nov 30 15:26:06 sdgxen-2 crmd: [1143]: info: te_fence_node: Executing reboot fencing operation (8) on sdgxen-3 (timeout=60000) Nov 30 15:26:06 sdgxen-2 stonith-ng: [1138]: info: can_fence_host_with_device: stonith-sbd can not fence sdgxen-3: dynamic-list Nov 30 15:26:06 sdgxen-2 stonith-ng: [1138]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 30 15:26:07 sdgxen-2 sbd: [1127]: info: Latency: 1 sdgxen-3 logs: Nov 30 15:26:05 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:05 sdgxen-3 stonith-ng: [1136]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 30 15:26:06 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:07 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:08 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:09 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:10 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:11 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:11 sdgxen-3 stonith-ng: [1136]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 30 15:26:12 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:13 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:14 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:15 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:16 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:17 sdgxen-3 sbd: [1125]: info: Latency: 1 Nov 30 15:26:17 sdgxen-3 stonith-ng: [1136]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 No nodes were successfully shot. With all do respect, this does not appear to be a configuration issue and appears to be a bug in pacemaker. I'd like to open a new bug regarding this issue. Thanks, -Hal On Mon, Nov 28, 2011 at 6:57 PM, Andreas Kurz <[email protected]> wrote: > On 11/29/2011 12:14 AM, Hal Martin wrote: >> Sorry; they were included in the previous email but it appears it was >> not properly spaced to be noticeable in the wall of text. > > Indeed ... already there, sorry for the noise. > > strange ... where does this timeout come from? I don't see an evidence > this fencing request ran for 60sec ... > > Did you try to provoke a fencing action without using crm shell? > > Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> >> Syslog from sdgxen-3: >> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_ais_dispatch: >> Update relayed from sdgxen-2 >> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: terminate (true) >> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_perform_update: >> Sent update 7: terminate=true >> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node >> sdgxen-2 now has id: 2065306796 >> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node >> 2065306796 is now known as sdgxen-2 >> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: stonith_command: >> Processed st_query from sdgxen-2: rc=0 >> Nov 28 15:01:21 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:22 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:23 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:24 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:25 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:26 sdgxen-3 sbd: [442]: info: Latency: 1 >> Nov 28 15:01:14 sdgxen-3 stonith-ng: [452]: info: stonith_command: >> Processed st_query from sdgxen-2: rc=0 >> >> Thanks, >> -Hal >> >> On Mon, Nov 28, 2011 at 6:10 PM, Andreas Kurz <[email protected]> wrote: >>> On 11/28/2011 08:07 PM, Hal Martin wrote: >>>> Thank you for the updated link. >>>> >>>> I have recompiled pacemaker from checkout b9889764 and stonith still >>>> fails to shoot nodes. >>> >>> Maybe posting also the logs from sdgxen-3 can help. >>> >>> Regards, >>> Andreas >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>>> >>>> sdgxen-2:/ # crm node fence sdgxen-3 >>>> Do you really want to shoot sdgxen-3? y >>>> >>>> Syslog from sdgxen-2: >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node >>>> sdgxen-3 will be fenced because termination was requested >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: >>>> determine_online_status: Node sdgxen-3 is unclean >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: stage6: Scheduling Node >>>> sdgxen-3 for STONITH >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: LogActions: Leave >>>> stonith-sbd(Started sdgxen-2) >>>> Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_state_transition: State >>>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS >>>> cause=C_IPC_MESSAGE origin=handle_response ] >>>> Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: unpack_graph: Unpacked >>>> transition 4: 4 actions in 4 synapses >>>> Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_te_invoke: Processing >>>> graph 4 (ref=pe_calc-dc-1322492480-29) derived from >>>> /var/lib/pengine/pe-warn-1278.bz2 >>>> Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_pseudo_action: Pseudo >>>> action 5 fired and confirmed >>>> Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_fence_node: Executing >>>> reboot fencing operation (8) on sdgxen-3 (timeout=60000) >>>> Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: >>>> initiate_remote_stonith_op: Initiating remote operation reboot for >>>> sdgxen-3: 76727be7-eecb-4778-857c-1a9288c63ee6 >>>> Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: >>>> can_fence_host_with_device: stonith-sbd can not fence sdgxen-3: >>>> dynamic-list >>>> Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: stonith_command: >>>> Processed st_query from sdgxen-2: rc=0 >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: process_pe_message: >>>> Transition 4: WARNINGs found during PE processing. PEngine Input >>>> stored in: /var/lib/pengine/pe-warn-1278.bz2 >>>> Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: process_pe_message: >>>> Configuration WARNINGs found during PE processing. Please run >>>> "crm_verify -L" to identify issues. >>>> Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR: >>>> remote_op_query_timeout: Query 76727be7-eecb-4778-857c-1a9288c63ee6 >>>> for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]: >>>> ERROR: remote_op_timeout: Action reboot >>>> (76727be7-eecb-4778-857c-1a9288c63ee6) for sdgxen-3 timed outNov 28 >>>> 15:01:26 sdgxen-2 stonith-ng: [452]: info: remote_op_done: Notifing >>>> clients of 76727be7-eecb-4778-857c-1a9288c63ee6 (reboot of sdgxen-3 >>>> from ee8c34db-0e5d-4227-aa46-0ad8b3f306d1 by (null)): 0, rc=-8Nov 28 >>>> 15:01:26 sdgxen-2 stonith-ng: [452]: info: stonith_notify_client: >>>> Sending st_fence-notification to client >>>> 457/67849bf4-1881-48b9-a5e8-ab1f72116a81Nov 28 15:01:26 sdgxen-2 crmd: >>>> [457]: info: tengine_stonith_callback: StonithOp <remote-op state="0" >>>> st_target="sdgxen-3" st_op="reboot" />Nov 28 15:01:26 sdgxen-2 crmd: >>>> [457]: info: tengine_stonith_callback: Stonith operation >>>> 2/8:4:0:bd203590-3295-4f31-a720-01760a5394e8: Operation timed out >>>> (-8)Nov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR: >>>> tengine_stonith_callback: Stonith of sdgxen-3 failed (-8)... aborting >>>> transition.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: >>>> abort_transition_graph: tengine_stonith_callback:454 - Triggered >>>> transition abort (complete=0) : Stonith failedNov 28 15:01:26 sdgxen-2 >>>> crmd: [457]: info: update_abort_priority: Abort priority upgraded from >>>> 0 to 1000000Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: >>>> update_abort_priority: Abort action done superceeded by restartNov 28 >>>> 15:01:26 sdgxen-2 crmd: [457]: ERROR: tengine_stonith_notify: Peer >>>> sdgxen-3 could not be terminated (reboot) by <anyone> for sdgxen-2 >>>> (ref=76727be7-eecb-4778-857c-1a9288c63ee6): Operation timed outNov 28 >>>> 15:01:26 sdgxen-2 crmd: [457]: info: run_graph: >>>> ====================================================Nov 28 15:01:26 >>>> sdgxen-2 crmd: [457]: notice: run_graph: Transition 4 (Complete=2, >>>> Pending=0, Fired=0, Skipped=2, Incomplete=0, >>>> Source=/var/lib/pengine/pe-warn-1278.bz2): StoppedNov 28 15:01:26 >>>> sdgxen-2 crmd: [457]: info: te_graph_trigger: Transition 4 is now >>>> completeNov 28 15:01:26 sdgxen-2 crmd: [457]: info: >>>> do_state_transition: State transition S_TRANSITION_ENGINE -> >>>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL >>>> origin=notify_crmd ]Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: >>>> do_state_transition: All 2 cluster nodes are eligible to run >>>> resources.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_pe_invoke: >>>> Query 81: Requesting the current CIB: S_POLICY_ENGINENov 28 15:01:26 >>>> sdgxen-2 crmd: [457]: info: do_pe_invoke_callback: Invoking the PE: >>>> query=81, ref=pe_calc-dc-1322492486-30, seq=240, quorate=1Nov 28 >>>> 15:01:26 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node sdgxen-3 >>>> will be fenced because termination was requestedNov 28 15:01:26 >>>> sdgxen-2 pengine: [456]: WARN: determine_online_status: Node sdgxen-3 >>>> is uncleanNov 28 15:01:26 sdgxen-2 pengine: [456]: WARN: stage6: >>>> Scheduling Node sdgxen-3 for STONITHNov 28 15:01:26 sdgxen-2 pengine: >>>> [456]: notice: LogActions: Leave stonith-sbd(Started sdgxen-2) >>>> Syslog from sdgxen-3: >>>> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_ais_dispatch: >>>> Update relayed from sdgxen-2 >>>> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_trigger_update: >>>> Sending flush op to all hosts for: terminate (true) >>>> Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_perform_update: >>>> Sent update 7: terminate=true >>>> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node >>>> sdgxen-2 now has id: 2065306796 >>>> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node >>>> 2065306796 is now known as sdgxen-2 >>>> Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: stonith_command: >>>> Processed st_query from sdgxen-2: rc=0 >>>> Nov 28 15:01:21 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:22 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:23 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:24 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:25 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:26 sdgxen-3 sbd: [442]: info: Latency: 1 >>>> Nov 28 15:01:14 sdgxen-3 stonith-ng: [452]: info: stonith_command: >>>> Processed st_query from sdgxen-2: rc=0 >>>> >>>> sdgxen-2:/ # crm_verify -L -V >>>> crm_verify[572]: 2011/11/28_15:06:01 WARN: pe_fence_node: Node >>>> sdgxen-3 will be fenced because termination was requested >>>> crm_verify[572]: 2011/11/28_15:06:01 WARN: determine_online_status: >>>> Node sdgxen-3 is unclean >>>> crm_verify[572]: 2011/11/28_15:06:01 WARN: stage6: Scheduling Node >>>> sdgxen-3 for STONITH >>>> Warnings found during check: config may not be valid >>>> >>>> sdgxen-2:/ # crm configure show >>>> node sdgxen-2 >>>> node sdgxen-3 >>>> primitive stonith-sbd stonith:external/sbd \ >>>> meta is-managed="true" target-role="Started" >>>> property $id="cib-bootstrap-options" \ >>>> dc-version="1.1.6-git" \ >>>> cluster-infrastructure="openais" \ >>>> stonith-enabled="true" \ >>>> stonith-timeout="60s" \ >>>> stonith-action="reboot" \ >>>> expected-quorum-votes="2" >>>> >>>> I appreciate any feedback on this issue. >>>> >>>> Thanks, >>>> Hal >>>> >>>> On Mon, Nov 28, 2011 at 10:44 AM, Florian Haas <[email protected]> wrote: >>>>> On Mon, Nov 28, 2011 at 4:35 PM, Hal Martin <[email protected]> wrote: >>>>>> Looking at the mercurial repository for pacemaker >>>>>> (http://hg.clusterlabs.org/pacemaker/) I do not see any check-ins >>>>>> since 1.1.6 was tagged two months ago. >>>>> >>>>> Pacemaker has since moved to GitHub: >>>>> >>>>> https://github.com/ClusterLabs/pacemaker >>>>> >>>>> Hope this helps. >>>>> >>>>> Cheers, >>>>> Florian >>>>> >>>>> -- >>>>> Need help with Pacemaker? >>>>> http://www.hastexo.com/knowledge/pacemaker >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> [email protected] >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>> >>> >>> _______________________________________________ >>> Linux-HA mailing list >>> [email protected] >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
