Thanks Lars, > SP1? That's no longer supported, and the overlapping support period to > SP2 long since expired. You really want to update to SP2+maintenance updates.
Unfortunately Os and SP version for the Oracle project these clusters belong to have been decided several layers over my head, I'll make it a point for upgrading to Sp2 anyway, I might get lucky. By this time I've taken an unskilled look at sbd.c and put a -v in the /etc/sysconfig/sbd file of a non-production cluster, enjoying the latency details In the syslog. While the SAN department investigate their side of the problem, I'll take a look at trying a different stonith resources, all servers involved have some kind of IBM management console. Thanks for your answers to my questions and for your time, very much appreciated. andrea ------------------------------ Message: 4 Date: Thu, 2 May 2013 23:36:42 +0200 From: Lars Marowsky-Bree <l...@suse.com> To: The Pacemaker cluster resource manager <pacemaker@oss.clusterlabs.org> Subject: Re: [Pacemaker] Frequent SBD triggered server reboots Message-ID: <20130502213642.gc3...@suse.de> Content-Type: text/plain; charset=iso-8859-1 On 2013-05-02T16:11:11, andrea cuozzo <andrea.cuo...@sysma.it> wrote: > external\SBD as stonith device. Operating system, is SLES 11 Sp1, > cluster components come from the SLES Sp1 HA package and are these versions: SP1? That's no longer supported, and the overlapping support period to SP2 long since expired. You really want to update to SP2+maintenance updates. > Each one of the three clusters will work fine for a couple of days, > then both servers of one of the clusters at the same time will start > the SBD > "WARN: Latency: No liveness for" countdown and restart. It happens at > different hours, and during different servers load (even at night, > when servers are close to 0% load). No two clusters have ever went > down at the same time. Their syslog is superclean, the only warning > messages before the reboots are the ones telling the SBD liveness > countdown. The SAN department can?t see anything wrong on their side, > the SAN is used by many other servers, no-one seems to be experiencing similar problems. That's really strange. Newer SBD versions cope much better with IO that gets stuck in the multipath layer forever - they'll timeout, abort and most of the time recover. You really want to upgrade. In case just your one SBD partitions goes bad, you can also have three of them, which obviously improves resilience (if they are on different disks/channels, or connected via iSCSI/FCoE etc). > - is the 1 MB size for the SBD partition strictly mandatory ? in SLES > 11 Sp1 HA documentation it's written: "In an environment where all > nodes have access to shared storage, a small partition (1MB) is > formated for the use with SBD", No, this is just the minimum size that SBD needs. You can make it larger if you want to. > http://www.gossamer-threads.com/lists/linuxha/pacemaker/84951 Mr. Lars > Marowsky-Bree says: "The new SBD versions will not become stuck on IO > anymore". Is the SBD version I'm using one that can become stuck on IO ? > I've checked without luck for SLES HA packages newer than the one I'm > using, but the SBD being stuck on IO really seems something that would > apply to my case. Yes. You really want to update, see the first paragraph. There are no newer SBD versions for SP1. (If you have LTSS, the story may be different, but in that case, kindly contact our support directly.) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer, HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org