Hi, On Wed, Mar 10, 2010 at 03:34:43PM -0000, darren.mans...@opengi.co.uk wrote: > On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: > > > Hi, > > On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk > wrote: > > Hi everyone. > > > > > > > > Further to some discussions a couple of weeks ago with regard to OCFS2 > > on SLES 11 HAE I'm looking to finally nail this problem. > > > > We have a 3 node cluster that has a STONITH shootout every week. This > > morning one node got stuck in a state where it couldn't be fenced due > > the RSA not being responsive. > > > > I'm not sure if the problem is due to: > > > > * Network interruption causing Totem failures. > > * Java (Tomcat) processes falling over. > > I suppose that those are activequote and activequoteadmin. You > should increase the timeouts, 10 seconds is too short in general, > and for java/tomcat probably even more so. > > > I've increased those. As the monitor operation in the LSB > script is just a pgrep I don't think it matters that the > monitor interval is 10s but the timeout is 30s. Is this > correct?
Yes, much better. Don't forget that there are some quite costly operations involved with each resource operation regardless of the nature of the operation itself (in particular forking several processes). > > * DLM falling over. > > * Any of the above in any combination. > > > > I've attached a hb_report. Could you see if you can see anything? > > Any good reason to ignore quorum? For a three node cluster you > should remove the no-quorum-policy property or, perhaps because > of ocfs2, set it to freeze. > > > Oops. It was a 2 node cluster. The 3rd node was added and > obviously that property was missed. > > > > Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a > SLE11 HAE update available. > > >From the logs: > > Mar 9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: > unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on > OGG-ACTIVEQUOTE-03: unknown exec error > > Interestingly, there is no lrmd log for this on 03. > > Then there are several operation timeouts, perhaps due to ocfs2 > hanging, two activequote and activequoteadmin stop operations > could not be killed even with -9, so they were probably waiting > for the disk. > > Mar 9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm ] info: > pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642 > > Do you know why the node vanished? You should try to keep your > networking healthy. > > > This is amazingly accurate. It turns out the datacentre had > some scheduled maintenance we weren't aware of and pulled the > network cable out causing this. Case solved. Although it > doesn't explain what happened on previous occasions. Well, when you provide a hb_report of those, perhaps we could seek an explanation :) Thanks, Dejan > > > Thanks, > > Dejan > > > > Thanks for your help! > > Darren > > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker