Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Angus Salkeld Tue, 06 Nov 2012 20:59:04 -0800

On 06/11/12 17:47 -0600, Andrew Martin wrote:

A bit more data on this problem: I was doing some maintenance and had to 
briefly disconnect storagequorum's connection to the STONITH network (ethernet 
cable #7 in this diagram):
http://sources.xes-inc.com/downloads/storagecluster.png



Since corosync has two rings (and is in active mode), this should cause no 
disruption to the cluster. However, as soon as I disconnected cable #7, 
corosync on storage0 died (corosync was already stopped on storage1), which 
caused pacemaker on storage0 to also shutdown. I was not able to obtain a 
coredump this time as apport is still running on storage0.


What else can I do to debug this problem? Or, should I just try to downgrade to 
corosync 1.4.2 (the version available in the Ubuntu repositories)?


Hi Andrew,

I setup 3 vms running ubuntu-12.04.1-server-amd64.iso and besides a build issue
it seems to be running fine (no pacemaker yet) with your Corosync config.

This is with 2 interfaces setup with rrp_mode: active.

So no crash yet...

BTW: Fabio tells me this is not exactly a well tested setup (active rrp mode) 
so this
might be part of your problem. Not sure if others can suggest a better tested 
setup.

Regards
-Angus



Thanks,


Andrew

----- Original Message -----

From: "Andrew Martin" <amar...@xes-inc.com>
To: "Angus Salkeld" <asalk...@redhat.com>
Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org
Sent: Tuesday, November 6, 2012 2:01:17 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster


Hi Angus,


I recompiled corosync with the changes you suggested in exec/main.c to generate 
fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata 
files:
http://sources.xes-inc.com/downloads/core.13027
http://sources.xes-inc.com/downloads/fdata.20121106



(gdb) thread apply all bt


Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)):
#0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0
#1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0
#2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0
#3 0x0000555555571700 in ?? ()
#4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5
#5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5
#6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5
#7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0
#8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0
#9 0x0000555555560945 in main ()




I've also been doing some hardware tests to rule it out as the cause of this 
problem: mcelog has found no problems and memtest finds the memory to be 
healthy as well.


Thanks,


Andrew
----- Original Message -----

From: "Angus Salkeld" <asalk...@redhat.com>
To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
Sent: Friday, November 2, 2012 8:18:51 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

On 02/11/12 13:07 -0500, Andrew Martin wrote:

Hi Angus,


Corosync died again while using libqb 0.14.3. Here is the coredump from today:
http://sources.xes-inc.com/downloads/corosync.nov2.coredump



# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide 
service.
info [MAIN ] Corosync built-in features: pie relro bindnow
Bus error (core dumped)


Here's the log: http://pastebin.com/bUfiB3T3


Did your analysis of the core dump reveal anything?


I can't get any symbols out of these coredumps. Can you try get a backtrace?


Is there a way for me to make it generate fdata with a bus error, or how else 
can I gather additional information to help debug this?


if you look in exec/main.c and look for SIGSEGV you will see how the mechanism
for fdata works. Just and a handler for SIGBUS and hook it up. Then you should
be able to get the fdata for both.

I'd rather be able to get a backtrace if possible.

-Angus


Thanks,


Andrew

----- Original Message -----

From: "Angus Salkeld" <asalk...@redhat.com>
To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
Sent: Thursday, November 1, 2012 5:47:16 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

On 01/11/12 17:27 -0500, Andrew Martin wrote:

Hi Angus,


I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this 
behavior with it. I was able to get a coredump by running corosync manually in 
the foreground (corosync -f):
http://sources.xes-inc.com/downloads/corosync.coredump


Thanks, looking...



There still isn't anything added to /var/lib/corosync however. What do I need 
to do to enable the fdata file to be created?


Well if it crashes with SIGSEGV it will generate it automatically.
(I see you are getting a bus error) - :(.

-A



Thanks,

Andrew

----- Original Message -----

From: "Angus Salkeld" <asalk...@redhat.com>
To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
Sent: Thursday, November 1, 2012 5:11:23 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

On 01/11/12 14:32 -0500, Andrew Martin wrote:

Hi Honza,


Thanks for the help. I enabled core dumps in /etc/security/limits.conf but 
didn't have a chance to reboot and apply the changes so I don't have a core 
dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID 
file to be generated? right now all that is in /var/lib/corosync are the 
ringid_XXX files. Do I need to set something explicitly in the corosync config 
to enable this logging?


I did find find something else interesting with libqb this time. I compiled 
libqb 0.14.2 for use with the cluster. This time when corosync died I noticed 
the following in dmesg:
Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide 
error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
libqb.so.0.14.2[7f657a525000+1f000]
This error was only present for one of the many other times corosync has died.


I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix 
for this particular bug. Could this libqb problem be related to the corosync to 
hang up? Here's the corresponding corosync log file (next time I should have a 
core dump as well):
http://pastebin.com/5FLKg7We


Hi Andrew

I can't see much wrong with the log either. If you could run with the latest
(libqb-0.14.3) and post a backtrace if it still happens, that would be great.

Thanks
Angus



Thanks,


Andrew

----- Original Message -----

From: "Jan Friesse" <jfrie...@redhat.com>
To: "Andrew Martin" <amar...@xes-inc.com>
Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
<pacemaker@oss.clusterlabs.org>
Sent: Thursday, November 1, 2012 7:55:52 AM
Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster

Ansdrew,
I was not able to find anything interesting (from corosync point of
view) in configuration/logs (corosync related).

What would be helpful:
- if corosync died, there should be
/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
xz them and store somewhere (they are quiet large but well compressible).
- If you are able to reproduce problem (what seems like you are), can
you please allow generating of coredumps and store somewhere backtrace
of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
here thread apply all bt). If you are running distribution with ABRT
support, you can also use ABRT to generate report.

Regards,
Honza

Andrew Martin napsal(a):

Corosync died an additional 3 times during the night on storage1. I wrote a 
daemon to attempt and start it as soon as it fails, so only one of those times 
resulted in a STONITH of storage1.

I enabled debug in the corosync config, so I was able to capture a period when 
corosync died with debug output:
http://pastebin.com/eAmJSmsQ
In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
reference, here is my Pacemaker configuration:
http://pastebin.com/DFL3hNvz

It seems that an extra node, 16777343 "localhost" has been added to the cluster 
after storage1 was STONTIHed (must be the localhost interface on storage1). Is there 
anyway to prevent this?

Does this help to determine why corosync is dying, and what I can do to fix it?

Thanks,

Andrew

----- Original Message -----

From: "Andrew Martin" <amar...@xes-inc.com>
To: disc...@corosync.org
Sent: Thursday, November 1, 2012 12:11:35 AM
Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster


Hello,

I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 
from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and 
storage1) are "real" nodes where the resources run (a DRBD disk, filesystem mount, and 
samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum 
node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the 
same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no 
longer had quorum (and the no-quorum-policy="freeze"), storagequorum was unable to 
STONITH either node and just left the resources frozen where they were running, on storage0. I 
cannot find any log information to determine why corosync crashed, and this is a disturbing problem 
as the cluster and its messaging layer must be stable. Below is my corosync configuration file as 
well as the corosync log file from each!

n!

o!

de during
this period.


corosync.conf:
http://pastebin.com/vWQDVmg8
Note that I have two redundant rings. On one of them, I specify the IP address 
(in this example 10.10.10.7) so that it binds to the correct interface (since 
potentially in the future those machines may have two interfaces on the same 
subnet).

corosync.log from storage0:
http://pastebin.com/HK8KYDDQ

corosync.log from storage1:
http://pastebin.com/sDWkcPUz

corosync.log from storagequorum (the DC during this period):
http://pastebin.com/uENQ5fnf

Issuing service corosync start && service pacemaker start on storage0 and 
storage1 resolved the problem and allowed the nodes to successfully reconnect to the 
cluster. What other information can I provide to help diagnose this problem and prevent 
it from recurring?

Thanks,

Andrew Martin

_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss





_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to