Thank you, that was the solution, now our stonith-timeout is 160s
our SBD Timeouts still are
    Timeout (watchdog) : 60
    Timeout (msgwait)  : 120
Yes they are long to avoid any problems with multipath driver.
We found similar recommended values in the latest SuSE SLES HA Guide.

Karl
On 2011-05-12T15:16:52, Karl Rößmann <[email protected]> wrote:

This is an Update to my last Mail:

SBD is running on one Node normally:

I didn't mean to inquire wrt the external/sbd fencing agent, but the
system daemon "sbd" - as configured via /etc/sysconfig/sbd and started
(automatically) via /etc/init.d/openais (on SLE HA).

but after powering off Node multix246, it is running on two nodes:


Node multix246: UNCLEAN (offline)
Online: [ multix244 multix245 ]

 Clone Set: dlm_clone [dlm]
     Started: [ multix244 multix245 ]
     Stopped: [ dlm:2 ]
 Clone Set: clvm_clone [clvm]
     Started: [ multix244 multix245 ]
     Stopped: [ clvm:2 ]
 Clone Set: vgsmet_clone [vgsmet]
     Started: [ multix244 multix245 ]
     Stopped: [ vgsmet:2 ]
 smetserv       (ocf::heartbeat:Xen):   Started multix244
SBD_Stonith (stonith:external/sbd) Started [ multix245 multix246 ] <-----

That is normal. It was running on the x246 node previously, but to fence
said node, it needs to be started in the local partition.

Normally, some seconds later, the fence should complete and multix246
should change state to "OFFLINE". The state you see above is only
transient.

If it remains stuck in this state for more longer, I would assume the
fence targetting multix246 isn't actually completing; do you see the
fence/stonith request being issued in the logs? Is there an error
message from sbd on multix245 in the above scenario?

Ah! Got it -

From your other mail, seeing that

sbd -d /dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 list
0       multix244       clear
1       multix245       clear
2       multix246       reset   multix245

suggests that multix246 actually was sent the request; and thus, should
be considered 'fenced' by the remaining cluster.

Looking back in your mails further:

/dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 dump
Header version     : 2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 60
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 120

You've set extremely long timeouts for the watchdog, and in particular
for the msgwait - this means that a fence will only be considered
completed after 120s by sbd. At the same time, you've set
stonith-timeout to 60s, so if the fence takes longer than that, it'll be
considered failed.

You've set up your cluster so that it can never complete a successful
fence - congratulations! ;-)

If you've got a legitimate reason for setting the msgwait timeout to
120s, you need to set the stonith-timeout to >120s - 140s, for example.


Regards,
    Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





--
Karl Rößmann                            Tel. +49-711-689-1657
Max-Planck-Institut FKF                 Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart                         email [email protected]

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to