Hi, On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk wrote: > Hi everyone. > > > > Further to some discussions a couple of weeks ago with regard to OCFS2 > on SLES 11 HAE I'm looking to finally nail this problem. > > We have a 3 node cluster that has a STONITH shootout every week. This > morning one node got stuck in a state where it couldn't be fenced due > the RSA not being responsive. > > I'm not sure if the problem is due to: > > * Network interruption causing Totem failures. > * Java (Tomcat) processes falling over.
I suppose that those are activequote and activequoteadmin. You should increase the timeouts, 10 seconds is too short in general, and for java/tomcat probably even more so. > * DLM falling over. > * Any of the above in any combination. > > I've attached a hb_report. Could you see if you can see anything? Any good reason to ignore quorum? For a three node cluster you should remove the no-quorum-policy property or, perhaps because of ocfs2, set it to freeze. Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a SLE11 HAE update available. >From the logs: Mar 9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03: unknown exec error Interestingly, there is no lrmd log for this on 03. Then there are several operation timeouts, perhaps due to ocfs2 hanging, two activequote and activequoteadmin stop operations could not be killed even with -9, so they were probably waiting for the disk. Mar 9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm ] info: pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642 Do you know why the node vanished? You should try to keep your networking healthy. Thanks, Dejan > > > Thanks > > Darren Mansell > > > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker