Re: [Linux-HA] Two fencing devices, long timeout - why?

Andrew Beekhof Sun, 10 Nov 2013 16:08:23 -0800

I can't really comment until I know which version of pacemaker this is...

On 1 Nov 2013, at 2:57 am, Jakob Curdes <j...@info-systems.de> wrote:


> Hi , I have a cman-based cluster that uses pcmk-fencing. we have configured 
> an ipmilan fencing device and an apc fencing device with stonith.
> I set a fencing order like this:
> <fencing-topology> \
>  <fencing-level devices="ipmi_gw2" id="fencing-gw2-2" index="2" 
> target="gw2"/> \
>  <fencing-level devices="apc_power_gw2" id="fencing-gw2-1" index="1" 
> target="gw2"/> \
> </fencing-topology>
> 
> This all works as intended, i.e. the apc is used as first device and shuts 
> down the server.
> But when I try to simulate a failure of the APC device by setting an 
> alternate IP (that is not reachable), fencing takes a very long time before 
> it succeeds.
> stonith-timeout for cluster is 20 secs. So I would expect that after 20 secs 
> it moves on to the second device. However what I see is this:
> 
> Oct 31 16:09:59 corosync [TOTEM ] A processor failed, forming new 
> configuration. # machine powered off manually
> Oct 31 16:10:00 [5875] gw1 stonith-ng:     info: stonith_action_create:       
>  Initiating action list for agent fence_apc (target=(null))
> Oct 31 16:10:00 [5875] gw1 stonith-ng:     info: 
> internal_stonith_action_execute:      Attempt 2 to execute fence_apc (list). 
> remaining timeout is 120
> Oct 31 16:10:01 [5875] gw1 stonith-ng:     info: update_remaining_timeout:    
>  Attempted to execute agent fence_apc (list) the maximum number of times (2) 
> allowed
> (...)
> Oct 31 16:10:06 [5879] gw1       crmd:   notice: te_fence_node:        
> Executing reboot fencing operation (151) on gw2 (timeout=20000)
> Oct 31 16:10:06 [5875] gw1 stonith-ng:   notice: handle_request:       Client 
> crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)'
> Oct 31 16:10:06 [5875] gw1 stonith-ng:   notice: merge_duplicates:     
> Merging stonith action reboot for node gw2 originating from client 
> crmd.5879.aa3600b2 with identical request from 
> stonith_admin.27011@gw1.ad19c5b3 (24s)
> Oct 31 16:10:06 [5875] gw1 stonith-ng:     info: initiate_remote_stonith_op:  
>  Initiating remote operation reboot for gw2: 
> aa3600b2-7b6d-4243-b525-de0a0a7399a8 (duplicate)
> Oct 31 16:10:06 [5875] gw1 stonith-ng:     info: stonith_command:      
> Processed st_fence from crmd.5879: Operation now in progress (-115)
> *Oct 31 16:11:30 [5879] gw1       crmd:    error: 
> stonith_async_timeout_handler:        Async call 4 timed out after 84000ms*
> Oct 31 16:11:30 [5879] gw1       crmd:   notice: tengine_stonith_callback:    
>  Stonith operation 4/151:40:0:304a2845-8177-4018-9fb6-7b94d0d1288a: Timer 
> expired (-62)
> Oct 31 16:11:30 [5879] gw1       crmd:   notice: tengine_stonith_callback:    
>  Stonith operation 4 for gw2 failed (Timer expired): aborting transition.
> Oct 31 16:11:30 [5879] gw1       crmd:     info: abort_transition_graph:      
>  tengine_stonith_callback:447 - Triggered transition abort (complete=0) : 
> Stonith failed
> Oct 31 16:11:30 [5879] gw1       crmd:   notice: te_fence_node:        
> Executing reboot fencing operation (151) on gw2 (timeout=20000)
> Oct 31 16:11:30 [5875] gw1 stonith-ng:   notice: handle_request:       Client 
> crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)'
> Oct 31 16:11:30 [5875] gw1 stonith-ng:   notice: merge_duplicates:     
> Merging stonith action reboot for node gw2 originating from client 
> crmd.5879.429b7850 with identical request from 
> stonith_admin.27011@gw1.ad19c5b3 (24s)
> Oct 31 16:11:30 [5875] gw1 stonith-ng:     info: initiate_remote_stonith_op:  
>  Initiating remote operation reboot for gw2: 
> 429b7850-9d23-49f0-abe4-1f18eb8d122a (duplicate)
> Oct 31 16:11:30 [5875] gw1 stonith-ng:     info: stonith_command:      
> Processed st_fence from crmd.5879: Operation now in progress (-115)
> Oct 31 16:12:24 [5875] gw1 stonith-ng:     info: call_remote_stonith:  Total 
> remote op timeout set to 240 for fencing of node gw2 for 
> stonith_admin.27011.ad19c5b3
> *Oct 31 16:12:24 [5875] gw1 stonith-ng:     info: call_remote_stonith:  
> Requesting that gw1 perform op reboot gw2 with ipmi_gw2 for 
> stonith_admin.27011 (144s)*
> Oct 31 16:12:24 [5875] gw1 stonith-ng:     info: stonith_command:      
> Processed st_fence from gw1: Operation now in progress (-115)
> Oct 31 16:12:24 [5875] gw1 stonith-ng:     info: stonith_action_create:       
>  Initiating action reboot for agent fence_ipmilan (target=gw2)
> *Oct 31 16:12:26 [5875] gw1 stonith-ng:   notice: log_operation:        
> Operation 'reboot' [27473] (call 0 from stonith_admin.27011) for host 'gw2' 
> with device 'ipmi_gw2' returned: 0 (OK)*
> 
> So it does not take 25 seconds to reboot gw2 withthe fallback stonith device 
> but more than two minutes, although we even see the 20 seconds timeout 
> values. What am I doing wrong?
> 
> Here is the device config :
> 
> primitive apc_power_gw2 stonith:fence_apc \
>        params ipaddr="192.168.33.64" pcmk_host_list="gw2" 
> pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" 
> login="***" port="1" action="reboot"
> primitive ipmi_gw1 stonith:fence_ipmilan \
>        params ipaddr="192.168.33.4" pcmk_host_list="gw1" 
> pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" 
> login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="20" 
> stonith-timeout="15s"
> primitive ipmi_gw2 stonith:fence_ipmilan \
>        params ipaddr="192.168.33.5" pcmk_host_list="gw2" 
> pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" 
> login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="15" 
> stonith-timeout="10s"
> 
> property $id="cib-bootstrap-options" \
>        cluster-infrastructure="cman" \
>        no-quorum-policy="ignore" \
>        stonith-enabled="true" \
>        stonith-timeout="20s" \
> 
> 
> Regards,
> Jakbo Curdes
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Two fencing devices, long timeout - why?

Reply via email to