On Mon, Aug 30, 2010 at 4:34 PM, Guillaume Chanaud <guillaume.chan...@connecting-nature.com> wrote: > Le 27/08/2010 16:29, Andrew Beekhof a écrit : >> >> On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud >> <guillaume.chan...@connecting-nature.com> wrote: >>> >>> Hello, >>> sorry for the delay it took, july is not the best month to get things >>> working fast. >> >> Neither is august :-) >> > lol sure :) >>> >>> Here is the core dump file (55MB) : >>> http://www.connecting-nature.com/corosync/core >>> corosync version is 1.2.3 >> >> Sorry, but I can't do anything with that file. >> Core files are only usable on the machine they came from. >> >> you'll have to open it with gdb and type "bt" to get a backtrace. > > Sorry , saw that after sending last mail. In fact i tried to debug/bt it, > but > 1. I'm not a c developer (i understand a little about it...) > 2. I never used gdb before uh, so hard to step into the corosync debug > > I'm not sure the trace will be usefull but here it is : > Core was generated by `corosync'. > Program terminated with signal 6, Aborted. > #0 0x0000003506a329a5 in raise (sig=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > (gdb) bt > #0 0x0000003506a329a5 in raise (sig=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #1 0x0000003506a34185 in abort () at abort.c:92 > #2 0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae > "token_memb_entries >= 1", file=<value optimized out>, line=1194, > function=<value optimized out>) at assert.c:81 > #3 0x00007fce14efb716 in memb_consensus_agreed (instance=0x7fce12338010) at > totemsrp.c:1194 > #4 0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010, > memb_join=0x822bf8) at totemsrp.c:3922 > #5 0x00007fce14f01a3a in message_handler_memb_join > (instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value > optimized out>, > endian_conversion_needed=<value optimized out>) at totemsrp.c:4165 > #6 0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>, > msg=0x822bf8, msg_len=420) at totemrrp.c:1404 > #7 0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>, > fd=<value optimized out>, revents=<value optimized out>, data=0x822550) > at totemudp.c:1244 > #8 0x00007fce14ef259a in poll_run (handle=2240235047305084928) at > coropoll.c:435 > #9 0x0000000000405594 in main (argc=<value optimized out>, argv=<value > optimized out>) at main.c:1558
Ok, definitely a corosync bug. > > I tried to compile it from source (1.2.7 tag and svn trunk) but i'm unable > to backtrace it as gdb tell me he doesn't find debuginfos (i did a > ./configure --enable-debug but gdb seems to need a > /usr/lib/debug/.build-id/... related to current executable, and i don't know > how to generate this) What about installing 1.2.7 from clusterlabs? If you still see it with 1.2.7, you should definitely report this to the openais mailing list. > On the 1.2.7 version, init script tell it started correctly but after one or > two seconds only lrmd and pengine processes are still alive > > On the trunk version, the init script fail to start (and so processes are > correctly killed) > > In the 1.2.7 when i'm stepping, i'm unable to go further than > service.c:201 res = service->exec_init_fn (corosync_api); > as it should create a new process for pacemaker services i think > (i don't know how to step inside this new process and debug it) > > If you need/want i'll let you access this vm via ssh to test/debug it. > > It should be related to other posts about "Could not connect to the CIB > service: connection failed" (i saw some message related to things more or > less like my problem) > > I put back end of the messages log here : > Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership > 208656: quorum acquired > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node > www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0) > ip(192.168.0.60) ( > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has > id: 83929280 > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): > id=83929280 state=member (new) addr=r(0) ip(192.168.0.5) votes=0 born=0 > seen=20865 > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node > filer2.connecting-nature.com now has id: 100706496 > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 is > now known as filer2.connecting-nature.com > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node > filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0) > ip(192.168.0.6) vo > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has > id: 1174448320 > Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): > id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70) votes=0 born=0 > seen=20 > Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is > operational > Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State > transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL > origin=do_st > Aug 30 16:30:50 www01 corosync[19809]: [TOTEM ] FAILED TO RECEIVE > Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 208656: > quorum retained > Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: Attempting > connection to fencing daemon... > Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected > Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Resource temporarily unavailable (11) > Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection > failed > Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: Receiving > message body failed: (2) Library error: Resource temporarily unavailable > (11) > Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS connection > terminated > Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS > connection failed > Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: stonith_peer_ais_destroy: > AIS connection terminated > Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Invalid argument (22) > Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS connection > failed > Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost > connection to OpenAIS service! > Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting... > Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost > connection to the CIB service [19817]. > Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost > connection to the CIB service [19817/callback]. > Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost > connection to the CIB service [19817/command]. > Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy: > Connection to the CIB terminated... > Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving message > body failed: (2) Library error: Invalid argument (22) > Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection > failed > Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS connection > terminated > > Strange things is that crmd find the hostname for > filer2.connectng-nature.com (which is the DC), but set it to <null> for all > other cluster nodes > > Thanks ! > Guillaume > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker