Hi,
It's my first try at asking for help on a mailing list, I hope I'll not make netiquette mistakes. I really could use some help on SBD, here's my scenario: I have three clusters with a similar configuration: two physical servers with a fibre channel shared storage, 4 resources (ip address, ext3 filesystem, oracle listener, oracle database) configured in a group, and external\SBD as stonith device. Operating system, is SLES 11 Sp1, cluster components come from the SLES Sp1 HA package and are these versions: openais: 1.1.4-5.6.3 pacemaker: 1.1.5-5.9.11.1 resource-agents: 3.9.3-0.4.26.1 cluster-glue: 1.0.8-0.4.4.1 corosync: 1.3.3-0.3.1 csync2: 1.34-0.2.39 Each one of the three clusters will work fine for a couple of days, then both servers of one of the clusters at the same time will start the SBD "WARN: Latency: No liveness for" countdown and restart. It happens at different hours, and during different servers load (even at night, when servers are close to 0% load). No two clusters have ever went down at the same time. Their syslog is superclean, the only warning messages before the reboots are the ones telling the SBD liveness countdown. The SAN department cant see anything wrong on their side, the SAN is used by many other servers, no-one seems to be experiencing similar problems. Hardware Cluster 1 and Cluster 2: two IBM blades, QLogic QMI2582 (one card, two ports), Brocade blade center FC switch, SAN switch, HP P9500 SAN Cluster 3: two IBM x3650, QLogic QLE2560 (two cards per server), SAN switch, HP P9500 SAN Each cluster have a 50GB LUN on the HP P9500 SAN (the SAN is in common, the LUNs are different): partition 1 (7.8 MB) for SBD, partition 2 (49.99 GB) for Oracle on ext3 What I have done so far: - introduced options qla2xxx ql2xmaxqdepth=16 qlport_down_retry=1 ql2xloginretrycount=5 ql2xextended_error_logging=1 in /etc/modprobe.conf.local (and mkinitrd and restarted the servers) - verified with the SAN department that the Qlogic firmware of my HBAs is compliant with their needs - configured multipath.conf as per HP specifications for the OPEN-V type of SAN - verified multipathd is working as expected, shutting down one port at a time, links stay up on the other port, and then shutting down both, cluster switches on the other node - configured SBD to use the watchdog device (softdog), and the first partition of the LUN, and all relevant tests confirm SBD is working as expected (list, dump, message test, message exit, killing the SBD process the server reboots), here's my /etc/sysconfig/SBD server1:~ # cat /etc/sysconfig/SBD SBD_DEVICE="/dev/mapper/san_part1" SBD_OPTS="-W" - enhanced (x2) the default values for Timeout (watchdog) and Timeout (msgwait), setting them at 10 and 20, while Stonith Timeout is 60s server1:~ # SBD -d /dev/mapper/san_part1 dump ==Dumping header on disk /dev/mapper/san_part1 Header version : 2 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 10 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 20 ==Header on disk /dev/mapper/san_part1 is dumped Ive even tested with 60 and 120 for Timeout (watchdog) and Timeout (msgwait), when the problem happened again the serves went all through the 60 seconds delay countdown to reboot. Borrowing the idea from here http://www.gossamer-threads.com/lists/linuxha/users/79213 , I'm monitoring access time on the SBD partition on the three clusters: average time to execute the dump command is 30ms, sometimes it spikes over 100ms a couple of times in an hour. There's no slow rise from the average when the problem comes, though, here's what it looked like the last time, frequency of the dump command is 2 seconds: ... real 0m0.031s real 0m0.031s real 0m0.030s real 0m0.030s real 0m0.030s real 0m0.030s real 0m0.031s ß-- last record on the file, no more logging, server will reboot after the timeout watchdog period ... Right before the last cluster reboot I was monitoring Oracle I/O towards its datafiles, to verify whether Oracle could access its partition, on the same LUN as the SBD one, when the SBD countdown start, to identify if its an SBD-only problem or a LUN access problem), and there was no sign of Oracle I/O problems during the countdown, it seems Oracle stopped interacting with the I/O monitor software the very moment the Oracle servers rebooted (all servers involved have a common time-server, but I cant be 100% sure they were in sync when I checked). I'm in close contact with the SAN department, the problem might well be the servers losing access to the LUN for some fibre channel matter they still can't see in their SAN logs, but I'd like to be 100% certain the cluster configuration is good. Here are my SBD related questions: - is the 1 MB size for the SBD partition strictly mandatory ? in SLES 11 Sp1 HA documentation it's written: "In an environment where all nodes have access to shared storage, a small partition (1MB) is formated for the use with SBD", while here http://linux-ha.org/wiki/SBD_Fencing there is no size suggested for it. At Os setup the SLES partitioner didn't allow us to create a 1MB partition, being it too small, the smallest size available was 7.8MB: can this difference in size introduce the random problem we're experiencing ? - I've read here http://www.gossamer-threads.com/lists/linuxha/pacemaker/84951 Mr. Lars Marowsky-Bree says: "The new SBD versions will not become stuck on IO anymore". Is the SBD version I'm using one that can become stuck on IO ? I've checked without luck for SLES HA packages newer than the one I'm using, but the SBD being stuck on IO really seems something that would apply to my case. Thanks and best regards.
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org