Re: [Openais] Corosync enters endless loop after hiccup in system

Colin Wed, 26 May 2010 02:46:53 -0700

On Tue, Mar 30, 2010 at 1:00 PM, Dejan Muhamedagic <[email protected]> wrote:
> On Tue, Mar 30, 2010 at 11:43:22AM +0200, Colin wrote:
>> we are running Corosync 1.2.0-0ubuntu1 on Ubuntu 10.4 beta w/current
>> updates; the cluster consists of two systems running in KVM, each on a
>> dedicated host.
>>
>> We have observed several times, but are unfortunately unable to nail
>> the exact cause, that when the virtualised system that is running
>> corosync has a "hiccup", i.e. hangs for couple of seconds when we
>> introduce a delay into its storage access, then the corosync process
>> enters an endless loop from which it doesn't ever seem to recover.
>>
>> In this endless loop the process uses 193% CPU in the 2-CPU
>> virtualised system, and is issuing a stream of wait4() system-calls
>> (with an occasional nanosleep() and some futex-stuff).
>
> It'd be good to kill -ABRT the process and then get the
> backtrace with gdb. If you're running pacemaker, there's
> hb_report to collect all relevant information (incl the
> backtraces). Make sure that coredumps are allowed and install the
> packages which contain the debugging information.


A few weeks later ... we didn't manage to reliably reproduce that
particular problem, but we've either found it again, or one of its
brothers: The corosync process goes into an endless loop, consumes
exactly one CPU at 100%.

We are still gathering evidence, but it's definitely something in
logsys.c, and probably something with the flight recorder.

Quite frankly, the code in logsys.c does not exactly build much
confidence, to the contrary, there is a lot of complicated fiddling
that is not properly encapsulated in a single place; the code strongly
violates the DRY and KISS principles, and looks a bit broken, for
example
- records_reclaim() does not check whether the flight recorder is
larger than "words", and neither does its only caller
_logsys_log_rec(), so in theory everybody else would need to check
this. And I don't want to know what the do-while-loop in
records_reclaim() does in this overflow case when moving flt_tail from
record to record arrives at the last one, and there's not another
record exactly adjacent...
- the adjustment of words_written near the end of _logsys_log_rec()
must also trigger when idx == index_start, i.e. when a record is
exactly as large as the advertised flight recorder size.

The 100% CPU problem is with version 1.2.0 in Ubuntu 10.04, version
1.2.2 produced core dumps, and the above observations were in
particular wrt. the source of version 1.2.3.

I'm aware that this isn't a very comprehensive bug report, I wanted to
give an overview of the problems we are having with corosync.

Also, we want to stick to the official Ubuntu packages, so presumably
if we narrow down the issue we should report all bugs regarding
corosync 1.2.0 in Ubuntu to the Ubuntu guys, or do you want to hear
about them here, too?

Regards, Colin


PS: In general the code often looks more complicated than necessary,
also on all levels, e.g. malloc() and memcpy() instead of a single
strdup(), manual bit fiddling instead of bitfields, repeated
calculations instead of factoring them out, the flight recorder with
it's weird wrapping-safeguard because there are no factored-out
"fltcpy()" functions, ...

PPS: Looking further through the code we noticed that several
signal-handlers are not safe, for example the function logsys_atexit()
in logsys.c is called from sigsegv_handler() and sigabrt_handler() in
main.cc, and it
- calls sem_wait() and free(), and probably other functions, that are
not async-signal-safe,
- accesses global variable wthread_active that is unsafe, an int
rather than a "volatile sig_atomic_t".

AFAICS these signal handlers can probably crash, or worse: deadlock
the corosync process...
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync enters endless loop after hiccup in system

Reply via email to