Am 03.04.2012 14:51, schrieb Lars Marowsky-Bree:
> On 2012-04-03T14:06:44, Rainer Krienke <[email protected]> wrote:
> 
>> thanks for the hint to enable the stonith resource. I did and checked
>> that it is set to true now, but after all the behaviour of the cluster
>> is still the same, if I do a halt -f on one node.
>> Access on the clusterfilesystem on the still running node simply hangs.
> 
> It'll pause until SBD has completed the fence.
> 
> This is either caused by a misconfigured sbd setup, or by a too short
> stonith-timeout for your sbd configuration. It'd need hb_report to
> diagnose, or you could try actively reading the logfiles.
> 
> 
> Regards,
>     Lars
> 

Hello

@Lars: thanks for the hint that UNCLEAN is not a state that is allowed.
I thought that "unclean" was the natural result of the fact that I did a
halt -f on this host. Just like a filesystem is unclean after a reset of
the host.

My SBD-Device is on a external ISCSI RAID and after I halted the other
node rzinstal5 manually by halt -f, the SBD disk is accessible from the
running node rzinstal4 without any problem.
I ran this dump after I had halted the node rzinstal5:

rzinstal4:~ # sbd -d /dev/disk/by-id/scsi-259316a7265713551-part1 dump
==Dumping header on disk /dev/disk/by-id/scsi-259316a7265713551-part1
Header version     : 2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 90
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 180
==Header on disk /dev/disk/by-id/scsi-259316a7265713551-part1 is dumped

I /var/log/messages I can see, that on rzinstal4 it is tried several
times to fence the dead host rzinstal5. But I cannot see why it is
unsuccessful. Of course the node is already dead so it cannot respond to
any messages written into the sbd device. But this should not be a problem.

Perhaps someone with more experiences in clustering can spot the problem
in the small log below, or can point me how to narrow the search.

The log (/var/log/messages) I posted below starts about one minute after
I halted rzinstal5.

....
Apr  3 14:45:56 rzinstal4 crmd: [3910]: info: te_fence_node: Executing
reboot fencing operation (33) on rzinstal5 (timeout=30000)
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info:
can_fence_host_with_device: Refreshing port list for stonith_sbd
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: WARN: parse_host_line:
Could not parse (0 0):
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info:
can_fence_host_with_device: stonith_sbd can fence rzinstal5: dynamic-list
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info: call_remote_stonith:
Requesting that rzinstal4 perform op reboot rzinstal5
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_fence: Exec
<stonith_command t="stonith-ng"
st_async_id="64c8badf-0753-42a6-9ab6-8ff778a3a4e3" st_op="st_fence"
st_callid="0" st_
callopt="0" st_remote_op="64c8badf-0753-42a6-9ab6-8ff778a3a4e3"
st_target="rzinstal5" st_device_action="reboot" st_timeout="27000"
src="rzinstal4" seq="3" />
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info:
can_fence_host_with_device: stonith_sbd can fence rzinstal5: dynamic-list
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_fence: Found
1 matching devices for 'rzinstal5'
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_command:
Processed st_fence from rzinstal4: rc=-1
Apr  3 14:45:57 rzinstal4 stonith-ng: [3904]: info: make_args:
reboot-ing node 'rzinstal5' as 'port=rzinstal5'
Apr  3 14:45:57 rzinstal4 sbd: [5268]: info: rzinstal5 owns slot 1
Apr  3 14:45:57 rzinstal4 sbd: [5268]: info: Writing reset to node slot
rzinstal5
Apr  3 14:46:29 rzinstal4 crmd: [3910]: info: tengine_stonith_callback:
Stonith operation 2/33:0:0:0409029b-02e7-498e-b4d1-650f9f7cad08:
Operation timed out (-8)
Apr  3 14:46:29 rzinstal4 crmd: [3910]: ERROR: tengine_stonith_callback:
Stonith of rzinstal5 failed (-8)... aborting transition.
Apr  3 14:46:29 rzinstal4 crmd: [3910]: info: abort_transition_graph:
tengine_stonith_callback:454 - Triggered transition abort (complete=0) :
Stonith failed
Apr  3 14:46:29 rzinstal4 crmd: [3910]: info: update_abort_priority:
Abort priority upgraded from 0 to 1000000
Apr  3 14:46:29 rzinstal4 crmd: [3910]: info: update_abort_priority:
Abort action done superceeded by restart
Apr  3 14:46:32 rzinstal4 stonith-ng: [3904]: ERROR: remote_op_timeout:
Action reboot (64c8badf-0753-42a6-9ab6-8ff778a3a4e3) for rzinstal5 timed out
Apr  3 14:46:32 rzinstal4 crmd: [3910]: WARN: stonith_perform_callback:
STONITH command failed: Operation timed out
Apr  3 14:46:32 rzinstal4 stonith-ng: [3904]: info: remote_op_done:
Notifing clients of 64c8badf-0753-42a6-9ab6-8ff778a3a4e3 (reboot of
rzinstal5 from 32d17687-11f1-4e83-b776-c86aae03b54b b
y (null)): 1, rc=-8
Apr  3 14:46:32 rzinstal4 crmd: [3910]: ERROR: tengine_stonith_notify:
Peer rzinstal5 could not be terminated (reboot) by <anyone> for
rzinstal4 (ref=64c8badf-0753-42a6-9ab6-8ff778a3a4e3):
Operation timed out
Apr  3 14:46:32 rzinstal4 stonith-ng: [3904]: info:
stonith_notify_client: Sending st_fence-notification to client
3910/0db1b22a-e55a-4375-84f3-2e6d2f98ec85
Apr  3 14:48:57 rzinstal4 sbd: [5268]: info: reset successfully
delivered to rzinstal5
Apr  3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation:
Operation 'reboot' [5262] (call 0 from (null)) for host 'rzinstal5' with
device 'stonith_sbd' returned: 0
Apr  3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation:
stonith_sbd: Performing: stonith -t external/sbd -T reset rzinstal5
Apr  3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation:
stonith_sbd: success: rzinstal5 0
Apr  3 14:48:58 rzinstal4 stonith-ng: [3904]: info:
process_remote_stonith_exec: ExecResult <st-reply
st_origin="stonith_construct_async_reply" t="stonith-ng"
st_op="st_notify" st_remote_op
="64c8badf-0753-42a6-9ab6-8ff778a3a4e3" st_callid="0" st_callopt="0"
st_rc="0" st_output="Performing: stonith -t external/sbd -T reset
rzinstal5 success: rzinstal5 0 " src="rzinstal4" seq="
4" />
Apr  3 14:48:58 rzinstal4 stonith-ng: [3904]: ERROR: remote_op_done:
We've already notified clients of 64c8badf-0753-42a6-9ab6-8ff778a3a4e3
(reboot of rzinstal5 from 32d17687-11f1-4e83-b776
-c86aae03b54b by rzinstal4): 2, rc=0

Thanks a lot
Rainer
-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, http://userpages.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html,Fax: +49261287
1001312
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to