Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Jan Friesse Thu, 01 Nov 2012 06:00:35 -0700

Ansdrew,
I was not able to find anything interesting (from corosync point of
view) in configuration/logs (corosync related).


What would be helpful:
- if corosync died, there should be
/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
xz them and store somewhere (they are quiet large but well compressible).
- If you are able to reproduce problem (what seems like you are), can
you please allow generating of coredumps and store somewhere backtrace
of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
here thread apply all bt). If you are running distribution with ABRT
support, you can also use ABRT to generate report.

Regards,
  Honza

Andrew Martin napsal(a):
> Corosync died an additional 3 times during the night on storage1. I wrote a 
> daemon to attempt and start it as soon as it fails, so only one of those 
> times resulted in a STONITH of storage1. 
> 
> I enabled debug in the corosync config, so I was able to capture a period 
> when corosync died with debug output: 
> http://pastebin.com/eAmJSmsQ 
> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
> reference, here is my Pacemaker configuration: 
> http://pastebin.com/DFL3hNvz 
> 
> It seems that an extra node, 16777343 "localhost" has been added to the 
> cluster after storage1 was STONTIHed (must be the localhost interface on 
> storage1). Is there anyway to prevent this? 
> 
> Does this help to determine why corosync is dying, and what I can do to fix 
> it? 
> 
> Thanks, 
> 
> Andrew 
> 
> ----- Original Message -----
> 
> From: "Andrew Martin" <amar...@xes-inc.com> 
> To: disc...@corosync.org 
> Sent: Thursday, November 1, 2012 12:11:35 AM 
> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
> 
> 
> Hello, 
> 
> I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 
> and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 
> amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the 
> resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while 
> the third node (storagequorum) is in standby mode and acts as a quorum node 
> for the cluster. Today I discovered that corosync died on both storage0 and 
> storage1 at the same time. Since corosync died, pacemaker shut down as well 
> on both nodes. Because the cluster no longer had quorum (and the 
> no-quorum-policy="freeze"), storagequorum was unable to STONITH either node 
> and just left the resources frozen where they were running, on storage0. I 
> cannot find any log information to determine why corosync crashed, and this 
> is a disturbing problem as the cluster and its messaging layer must be 
> stable. Below is my corosync configuration file as well as the corosync log 
> file from each no!
 de during 
this period. 
> 
> corosync.conf: 
> http://pastebin.com/vWQDVmg8 
> Note that I have two redundant rings. On one of them, I specify the IP 
> address (in this example 10.10.10.7) so that it binds to the correct 
> interface (since potentially in the future those machines may have two 
> interfaces on the same subnet). 
> 
> corosync.log from storage0: 
> http://pastebin.com/HK8KYDDQ 
> 
> corosync.log from storage1: 
> http://pastebin.com/sDWkcPUz 
> 
> corosync.log from storagequorum (the DC during this period): 
> http://pastebin.com/uENQ5fnf 
> 
> Issuing service corosync start && service pacemaker start on storage0 and 
> storage1 resolved the problem and allowed the nodes to successfully reconnect 
> to the cluster. What other information can I provide to help diagnose this 
> problem and prevent it from recurring? 
> 
> Thanks, 
> 
> Andrew Martin 
> 
> _______________________________________________ 
> discuss mailing list 
> disc...@corosync.org 
> http://lists.corosync.org/mailman/listinfo/discuss 
> 
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> disc...@corosync.org
> http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to