On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: 

        
        Hi,
        
        On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk 
wrote:
        > Hi everyone.
        > 
        >  
        > 
        > Further to some discussions a couple of weeks ago with regard to OCFS2
        > on SLES 11 HAE I'm looking to finally nail this problem.
        > 
        > We have a 3 node cluster that has a STONITH shootout every week. This
        > morning one node got stuck in a state where it couldn't be fenced due
        > the RSA not being responsive.
        > 
        > I'm not sure if the problem is due to:
        > 
        > *         Network interruption causing Totem failures.
        > *         Java (Tomcat) processes falling over.
        
        I suppose that those are activequote and activequoteadmin. You
        should increase the timeouts, 10 seconds is too short in general,
        and for java/tomcat probably even more so.


I've increased those. As the monitor operation in the LSB script is just a 
pgrep I don't think it matters that the monitor interval is 10s but the timeout 
is 30s. Is this correct? 

        
        
        > *         DLM falling over.
        > *         Any of the above in any combination.
        > 
        > I've attached a hb_report. Could you see if you can see anything?
        
        Any good reason to ignore quorum? For a three node cluster you
        should remove the no-quorum-policy property or, perhaps because
        of ocfs2, set it to freeze.


Oops. It was a 2 node cluster. The 3rd node was added and obviously that 
property was missed. 

        
        
        Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
        SLE11 HAE update available.
        
        >From the logs:
        
        Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: 
unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on 
OGG-ACTIVEQUOTE-03: unknown exec error
        
        Interestingly, there is no lrmd log for this on 03.
        
        Then there are several operation timeouts, perhaps due to ocfs2
        hanging, two activequote and activequoteadmin stop operations
        could not be killed even with -9, so they were probably waiting
        for the disk.
        
        Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: 
pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642
        
        Do you know why the node vanished? You should try to keep your
        networking healthy.


This is amazingly accurate. It turns out the datacentre had some scheduled 
maintenance we weren't aware of and pulled the network cable out causing this. 
Case solved. Although it doesn't explain what happened on previous occasions. 

        
        
        Thanks,
        
        Dejan
        


Thanks for your help!

Darren


<<winmail.dat>>

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to