On Thu, Jun 24, 2010 at 9:16 AM, Steven Dake <[email protected]> wrote:
> On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
>>
>> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<[email protected]> wrote:
>>>
>>> Dear Gentle Reader....
>>>
>>> Attached is a small test program to stress initializing and finalizing
>>> communication between a corosync cpg client and the corosync daemon.
>>> The test was run under version 1.2.4. Initial testing was with a
>>> single node, subsequent testing occurred on a system consisting of 3
>>> nodes.
>>>
>>> 1) If the program is run in such a way that it loops on the
>>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>>> restarted while the program is looping (service corosync restart) then
>>> the application locks up in the corosync client library in a variety
>>> of interesting locations. This is easiest to reproduce in a single
>>> node system with a large iteration count and a usleep value between
>>> joins. 'stress_finalize -t 500 -i 10000 -u 1000 -v' Sometimes it
>>> recovers in a few seconds (analysis of strace indicated
>>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>>> multiple 2 second delays in error recovery from a lost corosync
>>> daemon). Sometimes it locks up solid! What is the proper way of
>>> handling the loss of the corosync daemon? Is it possible to have the
>>> cpg library have a fast error recovery in the case of a failed daemon?
>>>
>>> sample back trace of lockup:
>>> #0 0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>>> #1 0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>>> handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>>> res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>>> #2 0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>>> group=<value optimized out>) at cpg.c:458
>>> #3 0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>>> groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>> at stress_finalize.c:101
>>> #4 0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>>> at stress_finalize.c:243
>>
>> I've also started getting semaphore related stack traces.
>>
>
> the stack trace from Dan is different from yours Andrew. Yours is during
> startup. Dan is more concerned about the fact that sem_timedwait sits
> around for 2 seconds before returning information indicating the server has
> exited or stopped. (along with other issues)
>
>> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> 45 isem->value = value;
>> Missing separate debuginfos, use: debuginfo-install
>> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
>> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
>> libuuid-2.16-10.2.fc12.x86_64
>> (gdb) where
>> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> #1 0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value
>> optimized out>, service=<value optimized out>, request_size=1048576,
>> response_size=1048576, dispatch_size=1048576, handle=<value optimized
>> out>)
>> at coroipcc.c:706
>> #2 0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
>> #3 0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
>> #4 0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
>> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
>> at cluster.c:56
>> #5 0x000000000040e9fb in cib_init () at main.c:424
>> #6 0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218
>> (gdb) print *isem
>> Cannot access memory at address 0x7ff01f81a008
>>
>> sigh
>>
>
> This code literally hasn't been modified for over a year - strange to start
> seeing errors now.
>
> Is your /dev/shm full?
Looks like it.
Also probably explains:
Program terminated with signal 7, Bus error.
#0 memset () at ../sysdeps/x86_64/memset.S:1050
1050 movntdq %xmm0,(%rdi)
Missing separate debuginfos, use: debuginfo-install
libibverbs-1.1.3-3.fc12.x86_64 librdmacm-1.0.10-1.fc12.x86_64
nspr-4.8.2-1.fc12.x86_64 nss-3.12.4-14.fc12.x86_64
nss-util-3.12.4-8.fc12.x86_64
(gdb) where
#0 memset () at ../sysdeps/x86_64/memset.S:1050
#1 0x00007f08d98453ad in circular_memory_map (bytes=<value optimized
out>, buf=0x208388) at /usr/include/bits/string3.h:86
#2 _logsys_rec_init (bytes=<value optimized out>, buf=0x208388) at
logsys.c:1037
#3 0x00000000004069b5 in logsys_system_init () at main.c:89
#4 0x000000000040f776 in __do_global_ctors_aux ()
#5 0x00000000004038fb in _init ()
#6 0x0000000000000000 in ?? ()
Looks like you've got some leaks to fix up.
And possibly some error handling to write.
[r...@pcmk-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
4.2G 1.6G 2.4G 40% /
tmpfs 121M 0 121M 0% /dev/shm
/dev/vda1 194M 64M 121M 35% /boot
[r...@pcmk-2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync): [ OK ]
[r...@pcmk-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
4.2G 1.6G 2.4G 40% /
tmpfs 121M 7.0M 114M 6% /dev/shm
/dev/vda1 194M 64M 121M 35% /boot
[r...@pcmk-2 ~]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:.. [ OK ]
[r...@pcmk-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
4.2G 1.6G 2.4G 40% /
tmpfs 121M 3.9M 117M 4% /dev/shm
/dev/vda1 194M 64M 121M 35% /boot
[r...@pcmk-2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync): [ OK ]
[r...@pcmk-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
4.2G 1.6G 2.4G 40% /
tmpfs 121M 11M 110M 9% /dev/shm
/dev/vda1 194M 64M 121M 35% /boot
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais