Re: [Openais] Corosync enters endless loop after hiccup in system

Steven Dake Wed, 26 May 2010 10:23:07 -0700

On 05/26/2010 02:42 AM, Colin wrote:
> On Tue, Mar 30, 2010 at 1:00 PM, Dejan Muhamedagic<[email protected]>  wrote:
>> On Tue, Mar 30, 2010 at 11:43:22AM +0200, Colin wrote:
>>> we are running Corosync 1.2.0-0ubuntu1 on Ubuntu 10.4 beta w/current
>>> updates; the cluster consists of two systems running in KVM, each on a
>>> dedicated host.
>>>
>>> We have observed several times, but are unfortunately unable to nail
>>> the exact cause, that when the virtualised system that is running
>>> corosync has a "hiccup", i.e. hangs for couple of seconds when we
>>> introduce a delay into its storage access, then the corosync process
>>> enters an endless loop from which it doesn't ever seem to recover.
>>>
>>> In this endless loop the process uses 193% CPU in the 2-CPU
>>> virtualised system, and is issuing a stream of wait4() system-calls
>>> (with an occasional nanosleep() and some futex-stuff).
>>
>> It'd be good to kill -ABRT the process and then get the
>> backtrace with gdb. If you're running pacemaker, there's
>> hb_report to collect all relevant information (incl the
>> backtraces). Make sure that coredumps are allowed and install the
>> packages which contain the debugging information.
>
> A few weeks later ... we didn't manage to reliably reproduce that
> particular problem, but we've either found it again, or one of its
> brothers: The corosync process goes into an endless loop, consumes
> exactly one CPU at 100%.
>
> We are still gathering evidence, but it's definitely something in
> logsys.c, and probably something with the flight recorder.
>
> Quite frankly, the code in logsys.c does not exactly build much
> confidence, to the contrary, there is a lot of complicated fiddling
> that is not properly encapsulated in a single place; the code strongly
> violates the DRY and KISS principles, and looks a bit broken, for
> example
> - records_reclaim() does not check whether the flight recorder is
> larger than "words", and neither does its only caller


If you mean it doesn't check the 2nd parameter (words) is larger then 
the total flight record, potentially causing a crash, yes this is a 
weakness in error checking.

> _logsys_log_rec(), so in theory everybody else would need to check
> this. And I don't want to know what the do-while-loop in
> records_reclaim() does in this overflow case when moving flt_tail from
> record to record arrives at the last one, and there's not another
> record exactly adjacent...

clearly records_reclaim needs some ironing.

> - the adjustment of words_written near the end of _logsys_log_rec()
> must also trigger when idx == index_start, i.e. when a record is
> exactly as large as the advertised flight recorder size.

do you mean this adjustment?
         if (words_written < 0) {
                 words_written += flt_data_size;
         }

>
> The 100% CPU problem is with version 1.2.0 in Ubuntu 10.04, version
> 1.2.2 produced core dumps, and the above observations were in
> particular wrt. the source of version 1.2.3.
>
> I'm aware that this isn't a very comprehensive bug report, I wanted to
> give an overview of the problems we are having with corosync.
>
> Also, we want to stick to the official Ubuntu packages, so presumably
> if we narrow down the issue we should report all bugs regarding
> corosync 1.2.0 in Ubuntu to the Ubuntu guys, or do you want to hear
> about them here, too?
>
> Regards, Colin
>

I would encourage you to file distro bugs to encourage ubuntu to update 
to latest maintenance releases.

If you happen to reproduce issues on the latest maintenance release of 
corosync (that is 1.2.3 atm) we certainly want to hear about them.

Reporting problems here is a good place - since we can help you identify 
if the problem is already fixed.

>
> PS: In general the code often looks more complicated than necessary,
> also on all levels, e.g. malloc() and memcpy() instead of a single
> strdup(), manual bit fiddling instead of bitfields, repeated
> calculations instead of factoring them out, the flight recorder with
> it's weird wrapping-safeguard because there are no factored-out
> "fltcpy()" functions, ...
>

yes as always software can use improvement

> PPS: Looking further through the code we noticed that several
> signal-handlers are not safe, for example the function logsys_atexit()
> in logsys.c is called from sigsegv_handler() and sigabrt_handler() in
> main.cc, and it
> - calls sem_wait() and free(), and probably other functions, that are
> not async-signal-safe,

agree

> - accesses global variable wthread_active that is unsafe, an int
> rather than a "volatile sig_atomic_t".
>
> AFAICS these signal handlers can probably crash, or worse: deadlock
> the corosync process...

agree that needs some work

Regards
-steve

> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync enters endless loop after hiccup in system

Reply via email to