Hi Angus,
I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f): http://sources.xes-inc.com/downloads/corosync.coredump There still isn't anything added to /var/lib/corosync however. What do I need to do to enable the fdata file to be created? Thanks, Andrew ----- Original Message ----- From: "Angus Salkeld" <asalk...@redhat.com> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Thursday, November 1, 2012 5:11:23 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 01/11/12 14:32 -0500, Andrew Martin wrote: >Hi Honza, > > >Thanks for the help. I enabled core dumps in /etc/security/limits.conf but >didn't have a chance to reboot and apply the changes so I don't have a core >dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID >file to be generated? right now all that is in /var/lib/corosync are the >ringid_XXX files. Do I need to set something explicitly in the corosync config >to enable this logging? > > >I did find find something else interesting with libqb this time. I compiled >libqb 0.14.2 for use with the cluster. This time when corosync died I noticed >the following in dmesg: >Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide >error ip:7f657a52e517 sp:7fffd5068858 error:0 in >libqb.so.0.14.2[7f657a525000+1f000] >This error was only present for one of the many other times corosync has died. > > >I see that there is a newer version of libqb (0.14.3) out, but didn't see a >fix for this particular bug. Could this libqb problem be related to the >corosync to hang up? Here's the corresponding corosync log file (next time I >should have a core dump as well): >http://pastebin.com/5FLKg7We Hi Andrew I can't see much wrong with the log either. If you could run with the latest (libqb-0.14.3) and post a backtrace if it still happens, that would be great. Thanks Angus > > >Thanks, > > >Andrew > >----- Original Message ----- > >From: "Jan Friesse" <jfrie...@redhat.com> >To: "Andrew Martin" <amar...@xes-inc.com> >Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" ><pacemaker@oss.clusterlabs.org> >Sent: Thursday, November 1, 2012 7:55:52 AM >Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster > >Ansdrew, >I was not able to find anything interesting (from corosync point of >view) in configuration/logs (corosync related). > >What would be helpful: >- if corosync died, there should be >/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >xz them and store somewhere (they are quiet large but well compressible). >- If you are able to reproduce problem (what seems like you are), can >you please allow generating of coredumps and store somewhere backtrace >of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >here thread apply all bt). If you are running distribution with ABRT >support, you can also use ABRT to generate report. > >Regards, >Honza > >Andrew Martin napsal(a): >> Corosync died an additional 3 times during the night on storage1. I wrote a >> daemon to attempt and start it as soon as it fails, so only one of those >> times resulted in a STONITH of storage1. >> >> I enabled debug in the corosync config, so I was able to capture a period >> when corosync died with debug output: >> http://pastebin.com/eAmJSmsQ >> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For >> reference, here is my Pacemaker configuration: >> http://pastebin.com/DFL3hNvz >> >> It seems that an extra node, 16777343 "localhost" has been added to the >> cluster after storage1 was STONTIHed (must be the localhost interface on >> storage1). Is there anyway to prevent this? >> >> Does this help to determine why corosync is dying, and what I can do to fix >> it? >> >> Thanks, >> >> Andrew >> >> ----- Original Message ----- >> >> From: "Andrew Martin" <amar...@xes-inc.com> >> To: disc...@corosync.org >> Sent: Thursday, November 1, 2012 12:11:35 AM >> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >> >> >> Hello, >> >> I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 >> and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 >> amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the >> resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while >> the third node (storagequorum) is in standby mode and acts as a quorum node >> for the cluster. Today I discovered that corosync died on both storage0 and >> storage1 at the same time. Since corosync died, pacemaker shut down as well >> on both nodes. Because the cluster no longer had quorum (and the >> no-quorum-policy="freeze"), storagequorum was unable to STONITH either node >> and just left the resources frozen where they were running, on storage0. I >> cannot find any log information to determine why corosync crashed, and this >> is a disturbing problem as the cluster and its messaging layer must be >> stable. Below is my corosync configuration file as well as the corosync log >> file from each n! o! >de during >this period. >> >> corosync.conf: >> http://pastebin.com/vWQDVmg8 >> Note that I have two redundant rings. On one of them, I specify the IP >> address (in this example 10.10.10.7) so that it binds to the correct >> interface (since potentially in the future those machines may have two >> interfaces on the same subnet). >> >> corosync.log from storage0: >> http://pastebin.com/HK8KYDDQ >> >> corosync.log from storage1: >> http://pastebin.com/sDWkcPUz >> >> corosync.log from storagequorum (the DC during this period): >> http://pastebin.com/uENQ5fnf >> >> Issuing service corosync start && service pacemaker start on storage0 and >> storage1 resolved the problem and allowed the nodes to successfully >> reconnect to the cluster. What other information can I provide to help >> diagnose this problem and prevent it from recurring? >> >> Thanks, >> >> Andrew Martin >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> >> >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss > > >_______________________________________________ >Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >Project Home: http://www.clusterlabs.org >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >Bugs: http://bugs.clusterlabs.org _______________________________________________ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org