Greetings.

Let me set up the scene. Two node cluster, SUSE11 SP1 updated/patched as 
of 7/1/2011.

The sbd device and timeout values used are as follows.

capep01:~ # sbd -d 
/dev/disk/by-id/scsi-36001438005de94640000600007470000 dump
==Dumping header on disk 
/dev/disk/by-id/scsi-36001438005de94640000600007470000
Header version     : 2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 45
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 90
==Header on disk /dev/disk/by-id/scsi-36001438005de94640000600007470000 
is dumped

and the crm cluster property values are set to;

property $id="cib-bootstrap-options" \
         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
         cluster-infrastructure="openais" \
         expected-quorum-votes="2" \
         no-quorum-policy="ignore" \
         dc-deadtime="10s" \
         stonith-enabled="true" \
         stonith-timeout="100s" \
         last-lrm-refresh="1309554165"

rsc_defaults $id="rsc_defaults-options" \
         resource-stickiness="1000"


The crm resource entry for the stonith device looks like;

primitive st-sbd stonith:external/sbd   params 
sbd_device="/dev/disk/by-id/scsi-36001438005de94640000600007470000"
group stonith st-sbd

The sbd device is served via an hp eva via fiberchannel qla2xxx driver 
and uses multipathd as are all data disks in use.


At this point power off one machine and leave it down for the test.

Expected the cluster to failover after the stonith-timeout timer value 
was reached, 100s. Instead we seemed to have had an error and the 
cluster did not failover for the 15 minutes we gave it. Then brought up 
the downed box and when the cluster software started, the failover 
(fencing) completed.

I did not get enough time in the environment in order to perform 
multiple tests, but the test was brought about due to two similar issues 
experienced in the last two weeks on this and another cluster where by a 
member of the cluster failed due to memory issues and the cluster 
remaining in a fenced state.


Jul  1 20:36:26 capep01 crmd: [18250]: WARN: check_dead_member: Our DC 
node (capep02) left the cluster
Jul  1 20:36:26 capep01 corosync[18240]:   [CPG   ] chosen downlist from 
node r(0) ip(10.1.0.17)
Jul  1 20:36:26 capep01 corosync[18240]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul  1 20:36:26 capep01 crmd: [18250]: WARN: match_down_event: No match 
for shutdown action on capep02
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: pe_fence_node: Node 
capep02 will be fenced because it is un-expectedly down
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: determine_online_status: 
Node capep02 is unclean
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: custom_action: Action 
st-sbd_stop_0 on capep02 is unrunnable (offline)
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: custom_action: Marking 
node capep02 unclean
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: stage6: Scheduling Node 
capep02 for STONITH
Jul  1 20:36:26 capep01 pengine: [18249]: WARN: process_pe_message: 
Transition 0: WARNINGs found during PE processing. PEngine Input stored 
in: (null)
Jul  1 20:36:27 capep01 stonith-ng: [18245]: WARN: parse_host_line: 
Could not parse (0 0):
Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: 
can_fence_host_with_device: st-sbd can fence capep02: dynamic-list
Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: stonith_query: Found 
1 matching devices for 'capep02'
Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: call_remote_stonith: 
Requesting that capep01 perform op reboot capep02
Jul  1 20:36:27 capep01 stonith-ng: [18245]: info: stonith_fenceExec 
<stonith_command t="stonith-ng" 
st_async_id="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_op="st_fence" 
st_callid="0" st_callopt="0" 
st_remote_op="cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f" st_target="capep02" 
st_device_action="reboot" st_timeout="90000" src="capep01" seq="3" />

90 seconds after the initial event (msgwait I presume);

Jul  1 20:37:57 capep01 stonith-ng: [18245]: WARN: 
cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f process (PID 26414) timed out (try 
1).  Killing with signal SIGTERM (15).
Jul  1 20:37:57 capep01 stonith-ng: [18245]: ERROR: log_operation: 
Operation 'reboot' [26414] for host 'capep02' with device 'st-sbd' 
returned: 1 (call 0 from (null))
Jul  1 20:38:06 capep01 stonith-ng: [18245]: ERROR: remote_op_timeout: 
Action reboot (cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f) for capep02 timed out
Jul  1 20:38:06 capep01 crmd: [18250]: ERROR: tengine_stonith_callback: 
Stonith of capep02 failed (-7)... aborting transition.
Jul  1 20:38:06 capep01 crmd: [18250]: ERROR: tengine_stonith_notify: 
Peer capep02 could not be terminated (reboot) by <anyone> for capep01 
(ref=cc6d89dc-f34c-4077-bd5c-d6a48dcc6c9f): Operation timed out
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: pe_fence_node: Node 
capep02 will be fenced because it is un-expectedly down
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: determine_online_status: 
Node capep02 is unclean
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: custom_action: Action 
ip_SEPci_stop_0 on capep02 is unrunnable (offline)
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: custom_action: Marking 
node capep02 unclean
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: custom_action: Action 
SEPjci_stop_0 on capep02 is unrunnable (offline)
Jul  1 20:38:06 capep01 pengine: [18249]: WARN: custom_action: Marking 
node capep02 unclean
Jul  1 20:38:06 capep01 pengine: [18249]: ERROR: native_create_actions: 
Resource st-sbd (stonith::external/sbd) is active on 2 nodes attempting 
recovery


 From here it rolls on and on with the failed stonith messages and the 
host staying in a fenced state until the test was stopped.

Have tried this same set up on other clusters with iscsi serving the sbd 
disk -- non multipathed -- and the line where ... "with device 'st-sbd' 
returned:" shows 0 instead of 1 and fail over worked properly.

Would you happen to know the possible causes of stonith-ng returning a 1 
instead of 0?

Thanks in advance for any information.

Craig



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to