Re: [Pacemaker] Help with OCFS2 / DLM Stability

Dejan Muhamedagic Wed, 10 Mar 2010 08:06:17 -0800

Hi,

On Wed, Mar 10, 2010 at 03:34:43PM -0000, darren.mans...@opengi.co.uk wrote:
> On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: 
> 
>       
>       Hi,
>       
>       On Tue, Mar 09, 2010 at 11:37:02AM -0000, darren.mans...@opengi.co.uk 
> wrote:
>       > Hi everyone.
>       > 
>       >  
>       > 
>       > Further to some discussions a couple of weeks ago with regard to OCFS2
>       > on SLES 11 HAE I'm looking to finally nail this problem.
>       > 
>       > We have a 3 node cluster that has a STONITH shootout every week. This
>       > morning one node got stuck in a state where it couldn't be fenced due
>       > the RSA not being responsive.
>       > 
>       > I'm not sure if the problem is due to:
>       > 
>       > *         Network interruption causing Totem failures.
>       > *         Java (Tomcat) processes falling over.
>       
>       I suppose that those are activequote and activequoteadmin. You
>       should increase the timeouts, 10 seconds is too short in general,
>       and for java/tomcat probably even more so.
> 
> 
> I've increased those. As the monitor operation in the LSB
> script is just a pgrep I don't think it matters that the
> monitor interval is 10s but the timeout is 30s. Is this
> correct?


Yes, much better. Don't forget that there are some quite costly
operations involved with each resource operation regardless of
the nature of the operation itself (in particular forking several
processes).

>       > *         DLM falling over.
>       > *         Any of the above in any combination.
>       > 
>       > I've attached a hb_report. Could you see if you can see anything?
>       
>       Any good reason to ignore quorum? For a three node cluster you
>       should remove the no-quorum-policy property or, perhaps because
>       of ocfs2, set it to freeze.
> 
> 
> Oops. It was a 2 node cluster. The 3rd node was added and
> obviously that property was missed. 
> 
>       
>       
>       Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
>       SLE11 HAE update available.
>       
>       >From the logs:
>       
>       Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: 
> unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on 
> OGG-ACTIVEQUOTE-03: unknown exec error
>       
>       Interestingly, there is no lrmd log for this on 03.
>       
>       Then there are several operation timeouts, perhaps due to ocfs2
>       hanging, two activequote and activequoteadmin stop operations
>       could not be killed even with -9, so they were probably waiting
>       for the disk.
>       
>       Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: 
> pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642
>       
>       Do you know why the node vanished? You should try to keep your
>       networking healthy.
> 
> 
> This is amazingly accurate. It turns out the datacentre had
> some scheduled maintenance we weren't aware of and pulled the
> network cable out causing this. Case solved. Although it
> doesn't explain what happened on previous occasions. 

Well, when you provide a hb_report of those, perhaps we could
seek an explanation :)

Thanks,

Dejan

>       
>       
>       Thanks,
>       
>       Dejan
>       
> 
> 
> Thanks for your help!
> 
> Darren
> 
> 



_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Help with OCFS2 / DLM Stability

Reply via email to