Sorry, please ignore this mail. Client issues!
-----Original Message----- From: darren.mans...@opengi.co.uk [mailto:darren.mans...@opengi.co.uk] Sent: 10 March 2010 13:53 To: deja...@fastmail.fm Cc: pacemaker@oss.clusterlabs.org Subject: Re: Re: [Pacemaker] Help with OCFS2 / DLM Stability On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: > Hi, >=20 > On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk >wro= te: > > Hi everyone. > >=20 > > =20 > >=20 > > Further to some discussions a couple of weeks ago with regard to > >OCFS2 on SLES 11 HAE I'm looking to finally nail this problem. > >=20 > > We have a 3 node cluster that has a STONITH shootout every week. > >This morning one node got stuck in a state where it couldn't be > >fenced due the RSA not being responsive. > >=20 > > I'm not sure if the problem is due to: > >=20 > > * Network interruption causing Totem failures. > > * Java (Tomcat) processes falling over. >=20 > I suppose that those are activequote and activequoteadmin. You should >increase the timeouts, 10 seconds is too short in general, and for >java/tomcat probably even more so. >=20 > > * DLM falling over. > > * Any of the above in any combination. > >=20 > > I've attached a hb_report. Could you see if you can see anything? >=20 > Any good reason to ignore quorum? For a three node cluster you should >remove the no-quorum-policy property or, perhaps because of ocfs2, set >it to freeze. >=20 > Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a > SLE11 HAE update available. >=20 > From the logs: >=20 > Mar 9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: >unpack_rsc_op: = Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03: unk= nown exec error >=20 > Interestingly, there is no lrmd log for this on 03. >=20 > Then there are several operation timeouts, perhaps due to ocfs2 >hanging, two activequote and activequoteadmin stop operations could >not be killed even with -9, so they were probably waiting for the >disk. >=20 > Mar 9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm ] info: >pcmk_peer= _update: lost: OGG-ACTIVEQUOTE-03 504997642 >=20 > Do you know why the node vanished? You should try to keep your >networking healthy. >=20 > Thanks, >=20 > Dejan >=20 > > =20 > >=20 > > Thanks > >=20 > > Darren Mansell > >=20 > >=20 > >=20 > > =20 > >=20 >=20 >=20 > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >=20 >=20 > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker