Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Sebastian Kaps Thu, 11 Aug 2011 03:11:49 -0700

Hi,

On 04.08.2011, at 18:21, Steven Dake wrote:


>> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
>> for 11149 ms, flushing membership messages.
> 
> This process pause message indicates the scheduler doesn't schedule
> corosync for 11 seconds which is greater then the failure detection
> timeouts.  What does your config file look like?  What load are you running?


We've had another one of these this morning:
"Process pause detected for 11763 ms, flushing membership messages."
According to the graphs that are generated from Nagios data, the load of that 
system 
jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value 
for 
~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system got 
shot,
probably because the OCFS2 got confused by the node leaving the cluster.
At that time, the machine was only the standby node. The only things that could 
have been running then, are a daily backup run (TSM) that starts the night 
before 
and takes a few hours to complete - and the OCFS2-related processes (the backup 
of 
the OCFS2 filesystem is done on that machine).

What can I do to investigate this behavior? We've switched to the "deadline" 
cpu 
scheduler before the July 31st event. Could this cause this kind of behavior?
I was under the impression, that 'deadline' was designed to prevent exactly 
these
kinds of situations.
Further increasing the timeout above the current value of 10s doesn't look like
it's the solution for this problem.

The configuration is unchanged from the one I posted on August 4th.
The funny thing is, that the cluster did not show any problems since July 31st.

Thanks in advance!

-- 
Sebastian


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Reply via email to