Re: [Openais] Corosync enters endless loop after hiccup in system

Steven Dake Wed, 07 Jul 2010 16:06:40 -0700

On 03/30/2010 01:36 PM, Colin wrote:
> On Tue, Mar 30, 2010 at 6:48 PM, Steven Dake<[email protected]>  wrote:
>> On Tue, 2010-03-30 at 11:43 +0200, Colin wrote:
>>>
>>> we are running Corosync 1.2.0-0ubuntu1 on Ubuntu 10.4 beta w/current
>>> updates; the cluster consists of two systems running in KVM, each on a
>>> dedicated host.
>>>
>>> We have observed several times, but are unfortunately unable to nail
>>> the exact cause, that when the virtualised system that is running
>>> corosync has a "hiccup", i.e. hangs for couple of seconds when we
>>> introduce a delay into its storage access, then the corosync process
>>> enters an endless loop from which it doesn't ever seem to recover.
>>>
>>> In this endless loop the process uses 193% CPU in the 2-CPU
>>> virtualised system, and is issuing a stream of wait4() system-calls
>>> (with an occasional nanosleep() and some futex-stuff).
>>
>> If you could explain how you delay your vm for a short period, I could
>> debug.
>>
>> One thing you can try is to increase your token timeout (the token field
>> in the totem{} directive).  At the moment it is set to 1000 msec (1
>> second), but that may not be suitable for some virtualized environments.
>
> Typical Murphy's law, now that I'm trying to get a stack-trace I'm
> having problems reproducing the problem (after it having occurred
> several times previously, perhaps it's because I installed the debug
> build?) ... anyhow, I'm not worried about the cluster losing
> connection when the VM hiccups; it's the fact that it doesn't recover
> afterwards because the corosync-process is in an endless loop that is
> disconcerting.
>
> Regards, Colin


A believe this is fixed by a recent update in flatiron that has not been 
yet released as a z release. (revision 2985).

Regards
-steve
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync enters endless loop after hiccup in system

Reply via email to