I can't really comment until I know which version of pacemaker this is... On 1 Nov 2013, at 2:57 am, Jakob Curdes <j...@info-systems.de> wrote:
> Hi , I have a cman-based cluster that uses pcmk-fencing. we have configured > an ipmilan fencing device and an apc fencing device with stonith. > I set a fencing order like this: > <fencing-topology> \ > <fencing-level devices="ipmi_gw2" id="fencing-gw2-2" index="2" > target="gw2"/> \ > <fencing-level devices="apc_power_gw2" id="fencing-gw2-1" index="1" > target="gw2"/> \ > </fencing-topology> > > This all works as intended, i.e. the apc is used as first device and shuts > down the server. > But when I try to simulate a failure of the APC device by setting an > alternate IP (that is not reachable), fencing takes a very long time before > it succeeds. > stonith-timeout for cluster is 20 secs. So I would expect that after 20 secs > it moves on to the second device. However what I see is this: > > Oct 31 16:09:59 corosync [TOTEM ] A processor failed, forming new > configuration. # machine powered off manually > Oct 31 16:10:00 [5875] gw1 stonith-ng: info: stonith_action_create: > Initiating action list for agent fence_apc (target=(null)) > Oct 31 16:10:00 [5875] gw1 stonith-ng: info: > internal_stonith_action_execute: Attempt 2 to execute fence_apc (list). > remaining timeout is 120 > Oct 31 16:10:01 [5875] gw1 stonith-ng: info: update_remaining_timeout: > Attempted to execute agent fence_apc (list) the maximum number of times (2) > allowed > (...) > Oct 31 16:10:06 [5879] gw1 crmd: notice: te_fence_node: > Executing reboot fencing operation (151) on gw2 (timeout=20000) > Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: handle_request: Client > crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)' > Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: merge_duplicates: > Merging stonith action reboot for node gw2 originating from client > crmd.5879.aa3600b2 with identical request from > stonith_admin.27011@gw1.ad19c5b3 (24s) > Oct 31 16:10:06 [5875] gw1 stonith-ng: info: initiate_remote_stonith_op: > Initiating remote operation reboot for gw2: > aa3600b2-7b6d-4243-b525-de0a0a7399a8 (duplicate) > Oct 31 16:10:06 [5875] gw1 stonith-ng: info: stonith_command: > Processed st_fence from crmd.5879: Operation now in progress (-115) > *Oct 31 16:11:30 [5879] gw1 crmd: error: > stonith_async_timeout_handler: Async call 4 timed out after 84000ms* > Oct 31 16:11:30 [5879] gw1 crmd: notice: tengine_stonith_callback: > Stonith operation 4/151:40:0:304a2845-8177-4018-9fb6-7b94d0d1288a: Timer > expired (-62) > Oct 31 16:11:30 [5879] gw1 crmd: notice: tengine_stonith_callback: > Stonith operation 4 for gw2 failed (Timer expired): aborting transition. > Oct 31 16:11:30 [5879] gw1 crmd: info: abort_transition_graph: > tengine_stonith_callback:447 - Triggered transition abort (complete=0) : > Stonith failed > Oct 31 16:11:30 [5879] gw1 crmd: notice: te_fence_node: > Executing reboot fencing operation (151) on gw2 (timeout=20000) > Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: handle_request: Client > crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)' > Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: merge_duplicates: > Merging stonith action reboot for node gw2 originating from client > crmd.5879.429b7850 with identical request from > stonith_admin.27011@gw1.ad19c5b3 (24s) > Oct 31 16:11:30 [5875] gw1 stonith-ng: info: initiate_remote_stonith_op: > Initiating remote operation reboot for gw2: > 429b7850-9d23-49f0-abe4-1f18eb8d122a (duplicate) > Oct 31 16:11:30 [5875] gw1 stonith-ng: info: stonith_command: > Processed st_fence from crmd.5879: Operation now in progress (-115) > Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith: Total > remote op timeout set to 240 for fencing of node gw2 for > stonith_admin.27011.ad19c5b3 > *Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith: > Requesting that gw1 perform op reboot gw2 with ipmi_gw2 for > stonith_admin.27011 (144s)* > Oct 31 16:12:24 [5875] gw1 stonith-ng: info: stonith_command: > Processed st_fence from gw1: Operation now in progress (-115) > Oct 31 16:12:24 [5875] gw1 stonith-ng: info: stonith_action_create: > Initiating action reboot for agent fence_ipmilan (target=gw2) > *Oct 31 16:12:26 [5875] gw1 stonith-ng: notice: log_operation: > Operation 'reboot' [27473] (call 0 from stonith_admin.27011) for host 'gw2' > with device 'ipmi_gw2' returned: 0 (OK)* > > So it does not take 25 seconds to reboot gw2 withthe fallback stonith device > but more than two minutes, although we even see the 20 seconds timeout > values. What am I doing wrong? > > Here is the device config : > > primitive apc_power_gw2 stonith:fence_apc \ > params ipaddr="192.168.33.64" pcmk_host_list="gw2" > pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" > login="***" port="1" action="reboot" > primitive ipmi_gw1 stonith:fence_ipmilan \ > params ipaddr="192.168.33.4" pcmk_host_list="gw1" > pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" > login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="20" > stonith-timeout="15s" > primitive ipmi_gw2 stonith:fence_ipmilan \ > params ipaddr="192.168.33.5" pcmk_host_list="gw2" > pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" > login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="15" > stonith-timeout="10s" > > property $id="cib-bootstrap-options" \ > cluster-infrastructure="cman" \ > no-quorum-policy="ignore" \ > stonith-enabled="true" \ > stonith-timeout="20s" \ > > > Regards, > Jakbo Curdes > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems