Hi,
I don't personally believe, that problem is caused by syslog turned
on/off. It looks like some another race combined with deadlock, this
time caused by conjunction of schedwrk serialize lock and logsys
serialize lock.

Can you please try attached patch (apply to current trunk).

Regards,
  Honza

hj lee wrote:
> I noticed that this happens when corosync starts before syslog in init start
> order. I understand that corosync requires syslog, but at least it should
> start OK and should be operational OK even without syslog.
> 
> On Mon, Feb 8, 2010 at 10:59 AM, hj lee <[email protected]> wrote:
> 
>> Hi,
>>
>> When corosync starts to fail with some reasons, multiple corosync processes
>> are created. But this does not help at all. All these processes are stuck
>> also. How or who does start corosync multiple time? If it does, then it
>> should kill the corosync process before starting new corosync.
>>
>> Thanks
>> hj
>>
>> [r...@silverthorne4 tmp]# ps -ax | grep coro
>> Warning: bad syntax, perhaps a bogus '-'? See
>> /usr/share/doc/procps-3.2.7/FAQ
>>  3050 ?        Ssl    0:00 /usr/sbin/corosync
>>  3084 ?        S      0:00 /usr/sbin/corosync
>>  3085 ?        S      0:00 /usr/sbin/corosync
>>  3087 ?        S      0:00 /usr/sbin/corosync
>>  3088 ?        S      0:00 /usr/sbin/corosync
>>  3089 ?        S      0:00 /usr/sbin/corosync
>>  4571 pts/0    S+     0:00 grep coro
>>
>> stack trace of 3050:
>> (gdb) bt
>> #0  0x00c5d402 in __kernel_vsyscall ()
>> #1  0x0055d563 in __poll (fds=<value optimized out>, nfds=<value optimized
>> out>, timeout=<value optimized out>)
>>     at ../sysdeps/unix/sysv/linux/poll.c:87
>> #2  0x00703c36 in poll_run (handle=<value optimized out>) at coropoll.c:377
>> #3  0x0804c0fc in main (argc=Cannot access memory at address 0x6
>> ) at main.c:1082
>> (gdb) info thread
>>   3 Thread 0xb7e54b90 (LWP 3051)  0x00c5d402 in __kernel_vsyscall ()
>>   2 Thread 0xb73f0b90 (LWP 3083)  0x00c5d402 in __kernel_vsyscall ()
>> * 1 Thread 0xb7f646c0 (LWP 3050)  0x00c5d402 in __kernel_vsyscall ()
>> (gdb) thread 2
>> [Switching to thread 2 (Thread 0xb73f0b90 (LWP 3083))]#0  0x00c5d402 in
>> __kernel_vsyscall ()
>> (gdb) bt
>> #0  0x00c5d402 in __kernel_vsyscall ()
>> #1  0x00612df6 in nanosleep () from /lib/i686/nosegneg/libpthread.so.0
>> #2  0x007ffbc8 in ?? () from /usr/libexec/lcrso/pacemaker.lcrso
>> #3  0xb73f03a4 in ?? ()
>> #4  0x00000000 in ?? ()
>> (gdb) thread 3
>> [Switching to thread 3 (Thread 0xb7e54b90 (LWP 3051))]#0  0x00c5d402 in
>> __kernel_vsyscall ()
>> (gdb) bt
>> #0  0x00c5d402 in __kernel_vsyscall ()
>> #1  0x0060f615 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> /lib/i686/nosegneg/libpthread.so.0
>> #2  0x006fb35c in logsys_worker_thread (data=Could not find the frame base
>> for "logsys_worker_thread".
>> ) at logsys.c:716
>> #3  0x0060b4d2 in start_thread (arg=<value optimized out>) at
>> pthread_create.c:297
>> #4  0x0056748e in clone () from /lib/i686/nosegneg/libc.so.6
>>
>> Stacktrace of rest of them:
>> (gdb) bt
>> #0  0x00c5d402 in __kernel_vsyscall ()
>> #1  0x0060c607 in pthread_join (threadid=<value optimized out>,
>> thread_return=<value optimized out>) at pthread_join.c:89
>> #2  0x006fa815 in logsys_atexit () at logsys.c:1704
>> #3  0x0804ca2b in sigsegv_handler (num=11) at main.c:196
>> #4  <signal handler called>
>> #5  0x005265bb in __libc_fork () at
>> ../nptl/sysdeps/unix/sysv/linux/fork.c:48
>> #6  0x00614cf4 in __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26
>> #7  0x007fc49b in spawn_child () from /usr/libexec/lcrso/pacemaker.lcrso
>> #8  0x00800566 in pcmk_startup () from /usr/libexec/lcrso/pacemaker.lcrso
>> #9  0x0804e670 in corosync_service_link_and_init (corosync_api=0x805a480,
>> service_name=0x9455088 "pacemaker", service_ver=0)
>>     at service.c:179
>> #10 0x0804e8dc in corosync_service_defaults_link_and_init
>> (corosync_api=0x805a480) at service.c:441
>> #11 0x0804c8d5 in main_service_ready () at main.c:753
>> #12 0x00711834 in main_iface_change_fn (context=<value optimized out>,
>> iface_addr=<value optimized out>, iface_no=<value optimized out>)
>>     at totemsrp.c:4253
>> #13 0x00709f42 in rrp_iface_change_fn (context=Could not find the frame
>> base for "rrp_iface_change_fn".
>> ) at totemrrp.c:1422
>> #14 0x007074c4 in timer_function_netif_check_timeout (data=<value optimized
>> out>) at totemudp.c:1359
>> #15 0x00703e01 in poll_run (handle=<value optimized out>) at tlist.h:309
>> #16 0x0804c0fc in main (argc=6122908, argv=0x4a9c02) at main.c:1082
>>
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

diff --git a/trunk/exec/main.c b/trunk/exec/main.c
index beedc2c..a9aa094 100644
--- a/trunk/exec/main.c
+++ b/trunk/exec/main.c
@@ -156,9 +156,6 @@ static void unlink_all_completed (void)
 {
        poll_stop (corosync_poll_handle);
        coroipcs_ipc_exit ();
-       totempg_finalize ();
-
-       corosync_exit_error (AIS_DONE_EXIT);
 }
 
 void corosync_shutdown_request (void)
@@ -1538,6 +1535,8 @@ int main (int argc, char **argv)
         */
        poll_run (corosync_poll_handle);
 
+       totempg_finalize ();
+
        return EXIT_SUCCESS;
 }
 
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to