Re: [Openais] Corosync enters endless loop after hiccup in system

Colin Thu, 27 May 2010 02:58:46 -0700

On Wed, May 26, 2010 at 7:19 PM, Steven Dake <[email protected]> wrote:
> On 05/26/2010 02:42 AM, Colin wrote:
>>
>> Quite frankly, the code in logsys.c does not exactly build much
>> confidence, to the contrary, there is a lot of complicated fiddling
>> that is not properly encapsulated in a single place; the code strongly
>> violates the DRY and KISS principles, and looks a bit broken, for
>> example
>> - records_reclaim() does not check whether the flight recorder is
>> larger than "words", and neither does its only caller
>
> If you mean it doesn't check the 2nd parameter (words) is larger then the
> total flight record, potentially causing a crash, yes this is a weakness in
> error checking.


Yes, that's what I meant.

>> _logsys_log_rec(), so in theory everybody else would need to check
>> this. And I don't want to know what the do-while-loop in
>> records_reclaim() does in this overflow case when moving flt_tail from
>> record to record arrives at the last one, and there's not another
>> record exactly adjacent...
>
> clearly records_reclaim needs some ironing.

...

>> - the adjustment of words_written near the end of _logsys_log_rec()
>> must also trigger when idx == index_start, i.e. when a record is
>> exactly as large as the advertised flight recorder size.
>
> do you mean this adjustment?
>        if (words_written < 0) {
>                words_written += flt_data_size;
>        }

Yes.

> I would encourage you to file distro bugs to encourage ubuntu to update to
> latest maintenance releases.

Ok, hope they will do such updates in an LTS version...

> If you happen to reproduce issues on the latest maintenance release of
> corosync (that is 1.2.3 atm) we certainly want to hear about them.
>
> Reporting problems here is a good place - since we can help you identify if
> the problem is already fixed.

Ok.

>> PPS: Looking further through the code we noticed that several
>> signal-handlers are not safe, for example the function logsys_atexit()
>> in logsys.c is called from sigsegv_handler() and sigabrt_handler() in
>> main.cc, and it
>> - calls sem_wait() and free(), and probably other functions, that are
>> not async-signal-safe,
>
> agree
>
>> - accesses global variable wthread_active that is unsafe, an int
>> rather than a "volatile sig_atomic_t".
>>
>> AFAICS these signal handlers can probably crash, or worse: deadlock
>> the corosync process...
>
> agree that needs some work

One typical approach for handling signals, pain that they are, in
multi-threaded applications is to (a) block all signals in all
threads, except (b) there's one thread dedicated to signal handling
that can for example hang around in one of the wait(2)-family of
functions. This will of course only handle asynchronous signals; for
synchronous signals like SIGSEGV you'll need a global signal handler
-- or not, once there's a sigsegv I wouldn't trust my application do
perform any complex tasks like asynchronous logging, after all,
something must be very broken...

Regards, Colin
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync enters endless loop after hiccup in system

Reply via email to