On 06/23/2010 11:35 PM, Andrew Beekhof wrote: > On Thu, Jun 24, 2010 at 1:50 AM, dan clark<[email protected]> wrote: >> Dear Gentle Reader.... >> >> Attached is a small test program to stress initializing and finalizing >> communication between a corosync cpg client and the corosync daemon. >> The test was run under version 1.2.4. Initial testing was with a >> single node, subsequent testing occurred on a system consisting of 3 >> nodes. >> >> 1) If the program is run in such a way that it loops on the >> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is >> restarted while the program is looping (service corosync restart) then >> the application locks up in the corosync client library in a variety >> of interesting locations. This is easiest to reproduce in a single >> node system with a large iteration count and a usleep value between >> joins. 'stress_finalize -t 500 -i 10000 -u 1000 -v' Sometimes it >> recovers in a few seconds (analysis of strace indicated >> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for >> multiple 2 second delays in error recovery from a lost corosync >> daemon). Sometimes it locks up solid! What is the proper way of >> handling the loss of the corosync daemon? Is it possible to have the >> cpg library have a fast error recovery in the case of a failed daemon? >> >> sample back trace of lockup: >> #0 0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0 >> #1 0x0000003000002a34 in coroipcc_msg_send_reply_receive ( >> handle=<value optimized out>, iov=<value optimized out>, iov_len=1, >> res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465 >> #2 0x0000003000802db1 in cpg_leave (handle=1648075416440668160, >> group=<value optimized out>) at cpg.c:458 >> #3 0x0000000000400df8 in coInit (handle=0x7fffaefecdb0, >> groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1) >> at stress_finalize.c:101 >> #4 0x000000000040138a in main (argc=8, argv=0x7fffaefecf28) >> at stress_finalize.c:243 > > I've also started getting semaphore related stack traces. >
the stack trace from Dan is different from yours Andrew. Yours is during startup. Dan is more concerned about the fact that sem_timedwait sits around for 2 seconds before returning information indicating the server has exited or stopped. (along with other issues) > #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45 > 45 isem->value = value; > Missing separate debuginfos, use: debuginfo-install > audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64 > libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64 > libuuid-2.16-10.2.fc12.x86_64 > (gdb) where > #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45 > #1 0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value > optimized out>, service=<value optimized out>, request_size=1048576, > response_size=1048576, dispatch_size=1048576, handle=<value optimized > out>) > at coroipcc.c:706 > #2 0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798 > <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0, > our_uname=0x6182c0, nodeid=0x0) at ais.c:622 > #3 0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798 > <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0, > our_uname=0x6182c0, nodeid=0x0) at ais.c:585 > #4 0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0, > our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0) > at cluster.c:56 > #5 0x000000000040e9fb in cib_init () at main.c:424 > #6 0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218 > (gdb) print *isem > Cannot access memory at address 0x7ff01f81a008 > > sigh > This code literally hasn't been modified for over a year - strange to start seeing errors now. Is your /dev/shm full? Regards -steve _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
