Hi, On 04.08.2011, at 18:21, Steven Dake wrote:
>> Jul 31 03:51:02 node01 corosync[5870]: [TOTEM ] Process pause detected >> for 11149 ms, flushing membership messages. > > This process pause message indicates the scheduler doesn't schedule > corosync for 11 seconds which is greater then the failure detection > timeouts. What does your config file look like? What load are you running? We've had another one of these this morning: "Process pause detected for 11763 ms, flushing membership messages." According to the graphs that are generated from Nagios data, the load of that system jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value for ~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system got shot, probably because the OCFS2 got confused by the node leaving the cluster. At that time, the machine was only the standby node. The only things that could have been running then, are a daily backup run (TSM) that starts the night before and takes a few hours to complete - and the OCFS2-related processes (the backup of the OCFS2 filesystem is done on that machine). What can I do to investigate this behavior? We've switched to the "deadline" cpu scheduler before the July 31st event. Could this cause this kind of behavior? I was under the impression, that 'deadline' was designed to prevent exactly these kinds of situations. Further increasing the timeout above the current value of 10s doesn't look like it's the solution for this problem. The configuration is unchanged from the one I posted on August 4th. The funny thing is, that the cluster did not show any problems since July 31st. Thanks in advance! -- Sebastian _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker