Hi Lars,

>>/  Timeout (msgwait)  : 90
/>>/           stonith-timeout="100s" \
/
>You may want to increase stonith-timeout a bit further, to increase the
>difference between msgwait (the time that sbd will wait before
>returning) and the cluster allowed time for that. But 10s should not be
>too little, actually ...

Ok, I'll push it out by say +20s more and see when we can get in another test.

>>/  The crm resource entry for the stonith device looks like;
/>>/
/>>/  primitive st-sbd stonith:external/sbd   params
/>>/  sbd_device="/dev/disk/by-id/scsi-36001438005de94640000600007470000"
/>>/  group stonith st-sbd
/
>Why the group? You can drop that line fine.

Understood. It serves no real purpose and is just a visual item, so the output 
via a crm_mon lines up with the other resources displayed.

>>/  At this point power off one machine and leave it down for the test.
/>>/
/>>/  Expected the cluster to failover after the stonith-timeout timer value
/>>/  was reached, 100s.
/
>No, actually, it should treat the fence action as completed after
>"msgwait" seconds; "stonith-timeout" is the time pacemaker allows for
>that fence operation to complete maximum before considering it failed. I
>think you got these two confused.

Thanks for the clarification.

>>/  Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: stonith_fenceExec
/>>/  <stonith_command t="stonith-ng"
/>>/  st_async_id="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_op="st_fence"
/>>/  st_callid="0" st_callopt="0"
/>>/  st_remote_op="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_target="capep02"
/>>/  st_device_action="reboot" st_timeout="90000" src="capep01" seq="3" />
/
>Okay, this is weird. You've se the timeout to 100s, not 90s; that's what
>should be passed in here.

>If msgwait is approximately the same as st_timeout, since st_timeout
>starts counting when the process is kicked off and msgwait after the
>message has been written to disk, it is very probable that it never can
>complete successfully.

That certainly explains why the fence operation never returned.

Dump of the cib while searching for 90s and 100s is not showing anything 
related to 90s in it.

capep01:~ # egrep  '90s|100s' /tmp/cib.tmp
         <nvpair id="cib-bootstrap-options-stonith-timeout" 
name="stonith-timeout" value="100s"/>


Looking through the output from a crm configure show command - nothing there 
either;

capep01:~ # crm configure show | sed -f /root/joinsed | egrep '90s|100s'
property $id="cib-bootstrap-options"    
dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60"     
cluster-infrastructure="openais"        expected-quorum-votes="2"  
no-quorum-policy="ignore"       dc-deadtime="10s"       stonith-enabled="true"  
stonith-timeout="100s"  last-lrm-refresh="1309554165"


I checked other clusters where I did and can do some testing on and see 
st_timeout values shown which also do not match the stonith-timeout value.

For example;

osuse01:/var/log # crm configure show | grep stonith-timeout
         stonith-timeout="75s" \

Then grepped out some of the items from the messages when I tried some tests.

Jul  1 17:49:18 osuse01 stonith-ng: [3042]: info: 
stonith_fenceExec<stonith_command t="stonith-ng" 
st_async_id="f774e857-7d64-4398-a433-58fc075d6ab0" st_op="st_fence" 
st_callid="0" st_callopt="0" 
st_remote_op="f774e857-7d64-4398-a433-58fc075d6ab0" st_target="osuse02" 
st_device_action="reboot" st_timeout="54000" src="osuse01" seq="3" />
Jul  1 20:05:02 osuse01 stonith-ng: [3112]: info: 
stonith_fenceExec<stonith_command t="stonith-ng" 
st_async_id="64392ab1-0709-46de-b33e-dfb614c073d9" st_op="st_fence" 
st_callid="0" st_callopt="0" 
st_remote_op="64392ab1-0709-46de-b33e-dfb614c073d9" st_target="osuse02" 
st_device_action="reboot" st_timeout="67500" src="osuse01" seq="3" />

Interesting that st_timeout does not show 75 seconds on any try and looks 
rather random, like it's calculated.

Do you know which source program is pulling this value and how it calculates it 
when the event triggers? I would like to see the algorithm being used if there 
might be a math problem on a non-default value being used.


>>/  Would you happen to know the possible causes of stonith-ng returning a 1
/>>/  instead of 0?
/
>You're hitting a timeout; which you shouldn't, given the bits of the
>configuration you've posted.
>
>Is the above your whole CIB? Are you setting any other timeouts
>somewhere that might override the 100s timeout you set?
>
>Regards,
>     Lars
>-- 
>Architect Storage/HA, OPS Engineering, Novell, Inc.
>SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
>HRB 21284 (AG Nürnberg)
>"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

Thanks for the info Lars. Hopefully we'll be able to make some progress 
isolating this.

Take care.
Craig



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to