Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related).
What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet large but well compressible). - If you are able to reproduce problem (what seems like you are), can you please allow generating of coredumps and store somewhere backtrace of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and here thread apply all bt). If you are running distribution with ABRT support, you can also use ABRT to generate report. Regards, Honza Andrew Martin napsal(a): > Corosync died an additional 3 times during the night on storage1. I wrote a > daemon to attempt and start it as soon as it fails, so only one of those > times resulted in a STONITH of storage1. > > I enabled debug in the corosync config, so I was able to capture a period > when corosync died with debug output: > http://pastebin.com/eAmJSmsQ > In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For > reference, here is my Pacemaker configuration: > http://pastebin.com/DFL3hNvz > > It seems that an extra node, 16777343 "localhost" has been added to the > cluster after storage1 was STONTIHed (must be the localhost interface on > storage1). Is there anyway to prevent this? > > Does this help to determine why corosync is dying, and what I can do to fix > it? > > Thanks, > > Andrew > > ----- Original Message ----- > > From: "Andrew Martin" <amar...@xes-inc.com> > To: disc...@corosync.org > Sent: Thursday, November 1, 2012 12:11:35 AM > Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster > > > Hello, > > I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 > and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 > amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the > resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while > the third node (storagequorum) is in standby mode and acts as a quorum node > for the cluster. Today I discovered that corosync died on both storage0 and > storage1 at the same time. Since corosync died, pacemaker shut down as well > on both nodes. Because the cluster no longer had quorum (and the > no-quorum-policy="freeze"), storagequorum was unable to STONITH either node > and just left the resources frozen where they were running, on storage0. I > cannot find any log information to determine why corosync crashed, and this > is a disturbing problem as the cluster and its messaging layer must be > stable. Below is my corosync configuration file as well as the corosync log > file from each no! de during this period. > > corosync.conf: > http://pastebin.com/vWQDVmg8 > Note that I have two redundant rings. On one of them, I specify the IP > address (in this example 10.10.10.7) so that it binds to the correct > interface (since potentially in the future those machines may have two > interfaces on the same subnet). > > corosync.log from storage0: > http://pastebin.com/HK8KYDDQ > > corosync.log from storage1: > http://pastebin.com/sDWkcPUz > > corosync.log from storagequorum (the DC during this period): > http://pastebin.com/uENQ5fnf > > Issuing service corosync start && service pacemaker start on storage0 and > storage1 resolved the problem and allowed the nodes to successfully reconnect > to the cluster. What other information can I provide to help diagnose this > problem and prevent it from recurring? > > Thanks, > > Andrew Martin > > _______________________________________________ > discuss mailing list > disc...@corosync.org > http://lists.corosync.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > discuss mailing list > disc...@corosync.org > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org