On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: Hi, On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk wrote: > Hi everyone. > > > > Further to some discussions a couple of weeks ago with regard to OCFS2 > on SLES 11 HAE I'm looking to finally nail this problem. > > We have a 3 node cluster that has a STONITH shootout every week. This > morning one node got stuck in a state where it couldn't be fenced due > the RSA not being responsive. > > I'm not sure if the problem is due to: > > * Network interruption causing Totem failures. > * Java (Tomcat) processes falling over. I suppose that those are activequote and activequoteadmin. You should increase the timeouts, 10 seconds is too short in general, and for java/tomcat probably even more so.
I've increased those. As the monitor operation in the LSB script is just a pgrep I don't think it matters that the monitor interval is 10s but the timeout is 30s. Is this correct? > * DLM falling over. > * Any of the above in any combination. > > I've attached a hb_report. Could you see if you can see anything? Any good reason to ignore quorum? For a three node cluster you should remove the no-quorum-policy property or, perhaps because of ocfs2, set it to freeze. Oops. It was a 2 node cluster. The 3rd node was added and obviously that property was missed. Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a SLE11 HAE update available. >From the logs: Mar 9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03: unknown exec error Interestingly, there is no lrmd log for this on 03. Then there are several operation timeouts, perhaps due to ocfs2 hanging, two activequote and activequoteadmin stop operations could not be killed even with -9, so they were probably waiting for the disk. Mar 9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm ] info: pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642 Do you know why the node vanished? You should try to keep your networking healthy. This is amazingly accurate. It turns out the datacentre had some scheduled maintenance we weren't aware of and pulled the network cable out causing this. Case solved. Although it doesn't explain what happened on previous occasions. Thanks, Dejan Thanks for your help! Darren
<<winmail.dat>>
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker