On Tue, Mar 30, 2010 at 1:00 PM, Dejan Muhamedagic <[email protected]> wrote: > On Tue, Mar 30, 2010 at 11:43:22AM +0200, Colin wrote: >> we are running Corosync 1.2.0-0ubuntu1 on Ubuntu 10.4 beta w/current >> updates; the cluster consists of two systems running in KVM, each on a >> dedicated host. >> >> We have observed several times, but are unfortunately unable to nail >> the exact cause, that when the virtualised system that is running >> corosync has a "hiccup", i.e. hangs for couple of seconds when we >> introduce a delay into its storage access, then the corosync process >> enters an endless loop from which it doesn't ever seem to recover. >> >> In this endless loop the process uses 193% CPU in the 2-CPU >> virtualised system, and is issuing a stream of wait4() system-calls >> (with an occasional nanosleep() and some futex-stuff). > > It'd be good to kill -ABRT the process and then get the > backtrace with gdb. If you're running pacemaker, there's > hb_report to collect all relevant information (incl the > backtraces). Make sure that coredumps are allowed and install the > packages which contain the debugging information.
A few weeks later ... we didn't manage to reliably reproduce that particular problem, but we've either found it again, or one of its brothers: The corosync process goes into an endless loop, consumes exactly one CPU at 100%. We are still gathering evidence, but it's definitely something in logsys.c, and probably something with the flight recorder. Quite frankly, the code in logsys.c does not exactly build much confidence, to the contrary, there is a lot of complicated fiddling that is not properly encapsulated in a single place; the code strongly violates the DRY and KISS principles, and looks a bit broken, for example - records_reclaim() does not check whether the flight recorder is larger than "words", and neither does its only caller _logsys_log_rec(), so in theory everybody else would need to check this. And I don't want to know what the do-while-loop in records_reclaim() does in this overflow case when moving flt_tail from record to record arrives at the last one, and there's not another record exactly adjacent... - the adjustment of words_written near the end of _logsys_log_rec() must also trigger when idx == index_start, i.e. when a record is exactly as large as the advertised flight recorder size. The 100% CPU problem is with version 1.2.0 in Ubuntu 10.04, version 1.2.2 produced core dumps, and the above observations were in particular wrt. the source of version 1.2.3. I'm aware that this isn't a very comprehensive bug report, I wanted to give an overview of the problems we are having with corosync. Also, we want to stick to the official Ubuntu packages, so presumably if we narrow down the issue we should report all bugs regarding corosync 1.2.0 in Ubuntu to the Ubuntu guys, or do you want to hear about them here, too? Regards, Colin PS: In general the code often looks more complicated than necessary, also on all levels, e.g. malloc() and memcpy() instead of a single strdup(), manual bit fiddling instead of bitfields, repeated calculations instead of factoring them out, the flight recorder with it's weird wrapping-safeguard because there are no factored-out "fltcpy()" functions, ... PPS: Looking further through the code we noticed that several signal-handlers are not safe, for example the function logsys_atexit() in logsys.c is called from sigsegv_handler() and sigabrt_handler() in main.cc, and it - calls sem_wait() and free(), and probably other functions, that are not async-signal-safe, - accesses global variable wthread_active that is unsafe, an int rather than a "volatile sig_atomic_t". AFAICS these signal handlers can probably crash, or worse: deadlock the corosync process... _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
