Re: [Openais] recover from corosync daemon restart and cpg_finalize timing

Andrew Beekhof Thu, 24 Jun 2010 00:42:25 -0700

On Thu, Jun 24, 2010 at 9:16 AM, Steven Dake <[email protected]> wrote:
> On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
>>
>> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<[email protected]>  wrote:
>>>
>>> Dear Gentle Reader....
>>>
>>> Attached is a small test program to stress initializing and finalizing
>>> communication between a corosync cpg client and the corosync daemon.
>>> The test was run under version 1.2.4.  Initial testing was with a
>>> single node, subsequent testing occurred on a system consisting of 3
>>> nodes.
>>>
>>> 1) If the program is run in such a way that it loops on the
>>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>>> restarted while the program is looping (service corosync restart) then
>>> the application locks up in the corosync client library in a variety
>>> of interesting locations.  This is easiest to reproduce in a single
>>> node system with a large iteration count and a usleep value between
>>> joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
>>> recovers in a few seconds (analysis of strace indicated
>>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>>> multiple 2 second delays in error recovery from a lost corosync
>>> daemon).  Sometimes it locks up solid!   What is the proper way of
>>> handling the loss of the corosync daemon?  Is it possible to have the
>>> cpg library have a fast error recovery in the case of a failed daemon?
>>>
>>> sample back trace of lockup:
>>> #0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>>> #1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>>>   handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>>>   res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>>> #2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>>>   group=<value optimized out>) at cpg.c:458
>>> #3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>>>   groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>>   at stress_finalize.c:101
>>> #4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>>>   at stress_finalize.c:243
>>
>> I've also started getting semaphore related stack traces.
>>
>
> the stack trace from Dan is different from yours Andrew.  Yours is during
> startup.   Dan is more concerned about the fact that sem_timedwait sits
> around for 2 seconds before returning information indicating the server has
> exited or stopped.  (along with other issues)
>
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> 45        isem->value = value;
>> Missing separate debuginfos, use: debuginfo-install
>> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
>> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
>> libuuid-2.16-10.2.fc12.x86_64
>> (gdb) where
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> #1  0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value
>> optimized out>, service=<value optimized out>, request_size=1048576,
>> response_size=1048576, dispatch_size=1048576, handle=<value optimized
>> out>)
>>     at coroipcc.c:706
>> #2  0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
>> #3  0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
>> #4  0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
>> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
>> at cluster.c:56
>> #5  0x000000000040e9fb in cib_init () at main.c:424
>> #6  0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218
>> (gdb) print *isem
>> Cannot access memory at address 0x7ff01f81a008
>>
>> sigh
>>
>
> This code literally hasn't been modified for over a year - strange to start
> seeing errors now.
>
> Is your /dev/shm full?


Looks like it.
Also probably explains:

Program terminated with signal 7, Bus error.
#0  memset () at ../sysdeps/x86_64/memset.S:1050
1050            movntdq %xmm0,(%rdi)
Missing separate debuginfos, use: debuginfo-install
libibverbs-1.1.3-3.fc12.x86_64 librdmacm-1.0.10-1.fc12.x86_64
nspr-4.8.2-1.fc12.x86_64 nss-3.12.4-14.fc12.x86_64
nss-util-3.12.4-8.fc12.x86_64
(gdb) where
#0  memset () at ../sysdeps/x86_64/memset.S:1050
#1  0x00007f08d98453ad in circular_memory_map (bytes=<value optimized
out>, buf=0x208388) at /usr/include/bits/string3.h:86
#2  _logsys_rec_init (bytes=<value optimized out>, buf=0x208388) at
logsys.c:1037
#3  0x00000000004069b5 in logsys_system_init () at main.c:89
#4  0x000000000040f776 in __do_global_ctors_aux ()
#5  0x00000000004038fb in _init ()
#6  0x0000000000000000 in ?? ()

Looks like you've got some leaks to fix up.
And possibly some error handling to write.



[r...@pcmk-2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      4.2G  1.6G  2.4G  40% /
tmpfs                 121M     0  121M   0% /dev/shm
/dev/vda1             194M   64M  121M  35% /boot
[r...@pcmk-2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[r...@pcmk-2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      4.2G  1.6G  2.4G  40% /
tmpfs                 121M  7.0M  114M   6% /dev/shm
/dev/vda1             194M   64M  121M  35% /boot
[r...@pcmk-2 ~]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Waiting for corosync services to unload:..                 [  OK  ]
[r...@pcmk-2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      4.2G  1.6G  2.4G  40% /
tmpfs                 121M  3.9M  117M   4% /dev/shm
/dev/vda1             194M   64M  121M  35% /boot
[r...@pcmk-2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[r...@pcmk-2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      4.2G  1.6G  2.4G  40% /
tmpfs                 121M   11M  110M   9% /dev/shm
/dev/vda1             194M   64M  121M  35% /boot
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] recover from corosync daemon restart and cpg_finalize timing

Reply via email to