Am 03.04.2012 14:51, schrieb Lars Marowsky-Bree: > On 2012-04-03T14:06:44, Rainer Krienke <[email protected]> wrote: > >> thanks for the hint to enable the stonith resource. I did and checked >> that it is set to true now, but after all the behaviour of the cluster >> is still the same, if I do a halt -f on one node. >> Access on the clusterfilesystem on the still running node simply hangs. > > It'll pause until SBD has completed the fence. > > This is either caused by a misconfigured sbd setup, or by a too short > stonith-timeout for your sbd configuration. It'd need hb_report to > diagnose, or you could try actively reading the logfiles. > > > Regards, > Lars >
Hello @Lars: thanks for the hint that UNCLEAN is not a state that is allowed. I thought that "unclean" was the natural result of the fact that I did a halt -f on this host. Just like a filesystem is unclean after a reset of the host. My SBD-Device is on a external ISCSI RAID and after I halted the other node rzinstal5 manually by halt -f, the SBD disk is accessible from the running node rzinstal4 without any problem. I ran this dump after I had halted the node rzinstal5: rzinstal4:~ # sbd -d /dev/disk/by-id/scsi-259316a7265713551-part1 dump ==Dumping header on disk /dev/disk/by-id/scsi-259316a7265713551-part1 Header version : 2 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 90 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 180 ==Header on disk /dev/disk/by-id/scsi-259316a7265713551-part1 is dumped I /var/log/messages I can see, that on rzinstal4 it is tried several times to fence the dead host rzinstal5. But I cannot see why it is unsuccessful. Of course the node is already dead so it cannot respond to any messages written into the sbd device. But this should not be a problem. Perhaps someone with more experiences in clustering can spot the problem in the small log below, or can point me how to narrow the search. The log (/var/log/messages) I posted below starts about one minute after I halted rzinstal5. .... Apr 3 14:45:56 rzinstal4 crmd: [3910]: info: te_fence_node: Executing reboot fencing operation (33) on rzinstal5 (timeout=30000) Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: can_fence_host_with_device: Refreshing port list for stonith_sbd Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: WARN: parse_host_line: Could not parse (0 0): Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: can_fence_host_with_device: stonith_sbd can fence rzinstal5: dynamic-list Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: call_remote_stonith: Requesting that rzinstal4 perform op reboot rzinstal5 Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="64c8badf-0753-42a6-9ab6-8ff778a3a4e3" st_op="st_fence" st_callid="0" st_ callopt="0" st_remote_op="64c8badf-0753-42a6-9ab6-8ff778a3a4e3" st_target="rzinstal5" st_device_action="reboot" st_timeout="27000" src="rzinstal4" seq="3" /> Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: can_fence_host_with_device: stonith_sbd can fence rzinstal5: dynamic-list Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_fence: Found 1 matching devices for 'rzinstal5' Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: stonith_command: Processed st_fence from rzinstal4: rc=-1 Apr 3 14:45:57 rzinstal4 stonith-ng: [3904]: info: make_args: reboot-ing node 'rzinstal5' as 'port=rzinstal5' Apr 3 14:45:57 rzinstal4 sbd: [5268]: info: rzinstal5 owns slot 1 Apr 3 14:45:57 rzinstal4 sbd: [5268]: info: Writing reset to node slot rzinstal5 Apr 3 14:46:29 rzinstal4 crmd: [3910]: info: tengine_stonith_callback: Stonith operation 2/33:0:0:0409029b-02e7-498e-b4d1-650f9f7cad08: Operation timed out (-8) Apr 3 14:46:29 rzinstal4 crmd: [3910]: ERROR: tengine_stonith_callback: Stonith of rzinstal5 failed (-8)... aborting transition. Apr 3 14:46:29 rzinstal4 crmd: [3910]: info: abort_transition_graph: tengine_stonith_callback:454 - Triggered transition abort (complete=0) : Stonith failed Apr 3 14:46:29 rzinstal4 crmd: [3910]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000 Apr 3 14:46:29 rzinstal4 crmd: [3910]: info: update_abort_priority: Abort action done superceeded by restart Apr 3 14:46:32 rzinstal4 stonith-ng: [3904]: ERROR: remote_op_timeout: Action reboot (64c8badf-0753-42a6-9ab6-8ff778a3a4e3) for rzinstal5 timed out Apr 3 14:46:32 rzinstal4 crmd: [3910]: WARN: stonith_perform_callback: STONITH command failed: Operation timed out Apr 3 14:46:32 rzinstal4 stonith-ng: [3904]: info: remote_op_done: Notifing clients of 64c8badf-0753-42a6-9ab6-8ff778a3a4e3 (reboot of rzinstal5 from 32d17687-11f1-4e83-b776-c86aae03b54b b y (null)): 1, rc=-8 Apr 3 14:46:32 rzinstal4 crmd: [3910]: ERROR: tengine_stonith_notify: Peer rzinstal5 could not be terminated (reboot) by <anyone> for rzinstal4 (ref=64c8badf-0753-42a6-9ab6-8ff778a3a4e3): Operation timed out Apr 3 14:46:32 rzinstal4 stonith-ng: [3904]: info: stonith_notify_client: Sending st_fence-notification to client 3910/0db1b22a-e55a-4375-84f3-2e6d2f98ec85 Apr 3 14:48:57 rzinstal4 sbd: [5268]: info: reset successfully delivered to rzinstal5 Apr 3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation: Operation 'reboot' [5262] (call 0 from (null)) for host 'rzinstal5' with device 'stonith_sbd' returned: 0 Apr 3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation: stonith_sbd: Performing: stonith -t external/sbd -T reset rzinstal5 Apr 3 14:48:58 rzinstal4 stonith-ng: [3904]: info: log_operation: stonith_sbd: success: rzinstal5 0 Apr 3 14:48:58 rzinstal4 stonith-ng: [3904]: info: process_remote_stonith_exec: ExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op ="64c8badf-0753-42a6-9ab6-8ff778a3a4e3" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/sbd -T reset rzinstal5 success: rzinstal5 0 " src="rzinstal4" seq=" 4" /> Apr 3 14:48:58 rzinstal4 stonith-ng: [3904]: ERROR: remote_op_done: We've already notified clients of 64c8badf-0753-42a6-9ab6-8ff778a3a4e3 (reboot of rzinstal5 from 32d17687-11f1-4e83-b776 -c86aae03b54b by rzinstal4): 2, rc=0 Thanks a lot Rainer -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, http://userpages.uni-koblenz.de/~krienke, Tel: +49261287 1312 PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html,Fax: +49261287 1001312 _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
