On 2011-07-05T17:10:16, Craig Lesle <[email protected]> wrote:

Hi Craig,

> Timeout (msgwait)  : 90

>          stonith-timeout="100s" \

You may want to increase stonith-timeout a bit further, to increase the
difference between msgwait (the time that sbd will wait before
returning) and the cluster allowed time for that. But 10s should not be
too little, actually ...

> The crm resource entry for the stonith device looks like;
> 
> primitive st-sbd stonith:external/sbd   params 
> sbd_device="/dev/disk/by-id/scsi-36001438005de94640000600007470000"
> group stonith st-sbd

Why the group? You can drop that line fine.

> At this point power off one machine and leave it down for the test.
> 
> Expected the cluster to failover after the stonith-timeout timer value 
> was reached, 100s. 

No, actually, it should treat the fence action as completed after
"msgwait" seconds; "stonith-timeout" is the time pacemaker allows for
that fence operation to complete maximum before considering it failed. I
think you got these two confused.

> Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: stonith_fenceExec 
> <stonith_command t="stonith-ng" 
> st_async_id="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_op="st_fence" 
> st_callid="0" st_callopt="0" 
> st_remote_op="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_target="capep02" 
> st_device_action="reboot" st_timeout="90000" src="capep01" seq="3" />

Okay, this is weird. You've se the timeout to 100s, not 90s; that's what
should be passed in here.

If msgwait is approximately the same as st_timeout, since st_timeout
starts counting when the process is kicked off and msgwait after the
message has been written to disk, it is very probable that it never can
complete successfully.

> Would you happen to know the possible causes of stonith-ng returning a 1 
> instead of 0?

You're hitting a timeout; which you shouldn't, given the bits of the
configuration you've posted.

Is the above your whole CIB? Are you setting any other timeouts
somewhere that might override the 100s timeout you set?



Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to