Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Jan Friesse Thu, 08 Nov 2012 05:45:51 -0800

Andrew,
good news. I believe that I've found reproducer for problem you are
facing. Now, to be sure it's really same, can you please run :
df (interesting is /dev/shm)
and send output of ls -la /dev/shm?


I believe /dev/shm is full.

Now, as a quick workaround, just delete all qb-* from /dev/shm and
cluster should work. There are basically two problems:
- ipc_shm is leaking memory
- if there is no memory, libqb mmap nonallocated memory and receives sigbus

Angus is working on both issues.

Regards,
  Honza

Jan Friesse napsal(a):
> Andrew,
> thanks for valgrind report (even it didn't showed anything useful) and
> blackbox.
> 
> We believe that problem is because of access to invalid memory mapped by
> mmap operation. There are basically 3 places where we are doing mmap.
> 1.) corosync cpg_zcb functions (I don't believe this is the case)
> 2.) LibQB IPC
> 3.) LibQB blackbox
> 
> Now, because nether me nor Angus are able to reproduce the bug, can you
> please:
> - apply patches "Check successful initialization of IPC" and "Add
> support for selecting IPC type" (later versions), or use corosync from
> git (ether needle or master branch, they are same)
> - compile corosync
> - Add
> 
> qb {
>     ipc_type: socket
> }
> 
> to corosync.conf
> - Try running corosync
> 
> This may, but may not help solve problem, but it should help us to
> diagnose if problem is or isn't IPC one.
> 
> Thanks,
>   Honza
> 
> Andrew Martin napsal(a):
>> Angus and Honza, 
>>
>>
>> I recompiled corosync with --enable-debug. Below is a capture of the 
>> valgrind output when corosync dies, after switching rrp_mode to passive: 
>>
>> # valgrind corosync -f 
>> ==5453== Memcheck, a memory error detector 
>> ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. 
>> ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info 
>> ==5453== Command: corosync -f 
>> ==5453== 
>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
>> provide service. 
>> info [MAIN ] Corosync built-in features: debug pie relro bindnow 
>> ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised 
>> byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3BFC8: totemudp_token_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38CF0: totemnet_token_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E40FB5: totemrrp_token_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C1A4: totemudp_token_target_set (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38EBC: totemnet_token_target_set (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feff7f58 is on thread 1's stack 
>> ==5453== 
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
>> uninitialised byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feffb9da is on thread 1's stack 
>> ==5453== 
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
>> uninitialised byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feffb9da is on thread 1's stack 
>> ==5453== 
>> Ringbuffer: 
>> ->OVERWRITE 
>> ->write_pt [0] 
>> ->read_pt [0] 
>> ->size [2097152 words] 
>> =>free [8388608 bytes] 
>> =>used [0 bytes] 
>> ==5453== 
>> ==5453== HEAP SUMMARY: 
>> ==5453== in use at exit: 13,175,149 bytes in 1,648 blocks 
>> ==5453== total heap usage: 70,091 allocs, 68,443 frees, 67,724,863 bytes 
>> allocated 
>> ==5453== 
>> ==5453== LEAK SUMMARY: 
>> ==5453== definitely lost: 0 bytes in 0 blocks 
>> ==5453== indirectly lost: 0 bytes in 0 blocks 
>> ==5453== possibly lost: 2,100,062 bytes in 35 blocks 
>> ==5453== still reachable: 11,075,087 bytes in 1,613 blocks 
>> ==5453== suppressed: 0 bytes in 0 blocks 
>> ==5453== Rerun with --leak-check=full to see details of leaked memory 
>> ==5453== 
>> ==5453== For counts of detected and suppressed errors, rerun with: -v 
>> ==5453== Use --track-origins=yes to see where uninitialised values come from 
>> ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) 
>> Bus error (core dumped) 
>>
>>
>> I was also able to capture non-truncated fdata: 
>> http://sources.xes-inc.com/downloads/fdata-20121107 
>>
>>
>> Here is the coredump: 
>> http://sources.xes-inc.com/downloads/vgcore.5453 
>>
>>
>> I was not able to get corosync to crash without pacemaker also running, 
>> though I was not able to test for a long period of time. 
>>
>>
>> Another thing I discovered tonight was that the 127.0.1.1 entry in 
>> /etc/hosts (on both storage0 and storage1) was the source of the extra 
>> "localhost" entry in the cluster. I have removed this extraneous node so now 
>> only the 3 real nodes remain and commented out this line in /etc/hosts on 
>> all nodes in the cluster. 
>> http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html
>>  
>>
>>
>> Thanks, 
>>
>>
>> Andrew 
>> ----- Original Message -----
>>
>> From: "Jan Friesse" <jfrie...@redhat.com> 
>> To: "Andrew Martin" <amar...@xes-inc.com> 
>> Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, 
>> pacemaker@oss.clusterlabs.org 
>> Sent: Wednesday, November 7, 2012 2:00:20 AM 
>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>> cluster 
>>
>> Andrew, 
>>
>> Andrew Martin napsal(a): 
>>> A bit more data on this problem: I was doing some maintenance and had to 
>>> briefly disconnect storagequorum's connection to the STONITH network 
>>> (ethernet cable #7 in this diagram): 
>>> http://sources.xes-inc.com/downloads/storagecluster.png 
>>>
>>>
>>> Since corosync has two rings (and is in active mode), this should cause no 
>>> disruption to the cluster. However, as soon as I disconnected cable #7, 
>>> corosync on storage0 died (corosync was already stopped on storage1), which 
>>> caused pacemaker on storage0 to also shutdown. I was not able to obtain a 
>>> coredump this time as apport is still running on storage0. 
>>
>> I strongly believe corosync fault is because of original problem you 
>> have. Also I would recommend you to try passive mode. Passive mode is 
>> better, because if one link fails, passive mode make progress (delivers 
>> messages), where active mode doesn't (up to moment, when ring is marked 
>> as failed. After that, passive/active behaves same). Also passive mode 
>> is much better tested. 
>>
>>>
>>>
>>> What else can I do to debug this problem? Or, should I just try to 
>>> downgrade to corosync 1.4.2 (the version available in the Ubuntu 
>>> repositories)? 
>>
>> I would really like to find main issue (which looks like libqb one, 
>> rather then corosync). But if you decide to downgrade, please downgrade 
>> to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. 
>>
>>>
>>>
>>> Thanks, 
>>>
>>>
>>> Andrew 
>>
>> Regards, 
>> Honza 
>>
>>>
>>> ----- Original Message ----- 
>>>
>>> From: "Andrew Martin" <amar...@xes-inc.com> 
>>> To: "Angus Salkeld" <asalk...@redhat.com> 
>>> Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org 
>>> Sent: Tuesday, November 6, 2012 2:01:17 PM 
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>> cluster 
>>>
>>>
>>> Hi Angus, 
>>>
>>>
>>> I recompiled corosync with the changes you suggested in exec/main.c to 
>>> generate fdata when SIGBUS is triggered. Here 's the corresponding coredump 
>>> and fdata files: 
>>> http://sources.xes-inc.com/downloads/core.13027 
>>> http://sources.xes-inc.com/downloads/fdata.20121106 
>>>
>>>
>>>
>>> (gdb) thread apply all bt 
>>>
>>>
>>> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): 
>>> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 
>>> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 
>>> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 
>>> #3 0x0000555555571700 in ?? () 
>>> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 
>>> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 
>>> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 
>>> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 
>>> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 
>>> #9 0x0000555555560945 in main () 
>>>
>>>
>>>
>>>
>>> I've also been doing some hardware tests to rule it out as the cause of 
>>> this problem: mcelog has found no problems and memtest finds the memory to 
>>> be healthy as well. 
>>>
>>>
>>> Thanks, 
>>>
>>>
>>> Andrew 
>>> ----- Original Message ----- 
>>>
>>> From: "Angus Salkeld" <asalk...@redhat.com> 
>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
>>> Sent: Friday, November 2, 2012 8:18:51 PM 
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>> cluster 
>>>
>>> On 02/11/12 13:07 -0500, Andrew Martin wrote: 
>>>> Hi Angus, 
>>>>
>>>>
>>>> Corosync died again while using libqb 0.14.3. Here is the coredump from 
>>>> today: 
>>>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump 
>>>>
>>>>
>>>>
>>>> # corosync -f 
>>>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
>>>> provide service. 
>>>> info [MAIN ] Corosync built-in features: pie relro bindnow 
>>>> Bus error (core dumped) 
>>>>
>>>>
>>>> Here's the log: http://pastebin.com/bUfiB3T3 
>>>>
>>>>
>>>> Did your analysis of the core dump reveal anything? 
>>>>
>>>
>>> I can't get any symbols out of these coredumps. Can you try get a 
>>> backtrace? 
>>>
>>>>
>>>> Is there a way for me to make it generate fdata with a bus error, or how 
>>>> else can I gather additional information to help debug this? 
>>>>
>>>
>>> if you look in exec/main.c and look for SIGSEGV you will see how the 
>>> mechanism 
>>> for fdata works. Just and a handler for SIGBUS and hook it up. Then you 
>>> should 
>>> be able to get the fdata for both. 
>>>
>>> I'd rather be able to get a backtrace if possible. 
>>>
>>> -Angus 
>>>
>>>>
>>>> Thanks, 
>>>>
>>>>
>>>> Andrew 
>>>>
>>>> ----- Original Message ----- 
>>>>
>>>> From: "Angus Salkeld" <asalk...@redhat.com> 
>>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
>>>> Sent: Thursday, November 1, 2012 5:47:16 PM 
>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>>> cluster 
>>>>
>>>> On 01/11/12 17:27 -0500, Andrew Martin wrote: 
>>>>> Hi Angus, 
>>>>>
>>>>>
>>>>> I'll try upgrading to the latest libqb tomorrow and see if I can 
>>>>> reproduce this behavior with it. I was able to get a coredump by running 
>>>>> corosync manually in the foreground (corosync -f): 
>>>>> http://sources.xes-inc.com/downloads/corosync.coredump 
>>>>
>>>> Thanks, looking... 
>>>>
>>>>>
>>>>>
>>>>> There still isn't anything added to /var/lib/corosync however. What do I 
>>>>> need to do to enable the fdata file to be created? 
>>>>
>>>> Well if it crashes with SIGSEGV it will generate it automatically. 
>>>> (I see you are getting a bus error) - :(. 
>>>>
>>>> -A 
>>>>
>>>>>
>>>>>
>>>>> Thanks, 
>>>>>
>>>>> Andrew 
>>>>>
>>>>> ----- Original Message ----- 
>>>>>
>>>>> From: "Angus Salkeld" <asalk...@redhat.com> 
>>>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
>>>>> Sent: Thursday, November 1, 2012 5:11:23 PM 
>>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>>>> cluster 
>>>>>
>>>>> On 01/11/12 14:32 -0500, Andrew Martin wrote: 
>>>>>> Hi Honza, 
>>>>>>
>>>>>>
>>>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf 
>>>>>> but didn't have a chance to reboot and apply the changes so I don't have 
>>>>>> a core dump this time. Do core dumps need to be enabled for the 
>>>>>> fdata-DATETIME-PID file to be generated? right now all that is in 
>>>>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something 
>>>>>> explicitly in the corosync config to enable this logging? 
>>>>>>
>>>>>>
>>>>>> I did find find something else interesting with libqb this time. I 
>>>>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync 
>>>>>> died I noticed the following in dmesg: 
>>>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap 
>>>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
>>>>>> libqb.so.0.14.2[7f657a525000+1f000] 
>>>>>> This error was only present for one of the many other times corosync has 
>>>>>> died. 
>>>>>>
>>>>>>
>>>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't 
>>>>>> see a fix for this particular bug. Could this libqb problem be related 
>>>>>> to the corosync to hang up? Here's the corresponding corosync log file 
>>>>>> (next time I should have a core dump as well): 
>>>>>> http://pastebin.com/5FLKg7We 
>>>>>
>>>>> Hi Andrew 
>>>>>
>>>>> I can't see much wrong with the log either. If you could run with the 
>>>>> latest 
>>>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be 
>>>>> great. 
>>>>>
>>>>> Thanks 
>>>>> Angus 
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks, 
>>>>>>
>>>>>>
>>>>>> Andrew 
>>>>>>
>>>>>> ----- Original Message ----- 
>>>>>>
>>>>>> From: "Jan Friesse" <jfrie...@redhat.com> 
>>>>>> To: "Andrew Martin" <amar...@xes-inc.com> 
>>>>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
>>>>>> <pacemaker@oss.clusterlabs.org> 
>>>>>> Sent: Thursday, November 1, 2012 7:55:52 AM 
>>>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>>>
>>>>>> Ansdrew, 
>>>>>> I was not able to find anything interesting (from corosync point of 
>>>>>> view) in configuration/logs (corosync related). 
>>>>>>
>>>>>> What would be helpful: 
>>>>>> - if corosync died, there should be 
>>>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please 
>>>>>> xz them and store somewhere (they are quiet large but well 
>>>>>> compressible). 
>>>>>> - If you are able to reproduce problem (what seems like you are), can 
>>>>>> you please allow generating of coredumps and store somewhere backtrace 
>>>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
>>>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and 
>>>>>> here thread apply all bt). If you are running distribution with ABRT 
>>>>>> support, you can also use ABRT to generate report. 
>>>>>>
>>>>>> Regards, 
>>>>>> Honza 
>>>>>>
>>>>>> Andrew Martin napsal(a): 
>>>>>>> Corosync died an additional 3 times during the night on storage1. I 
>>>>>>> wrote a daemon to attempt and start it as soon as it fails, so only one 
>>>>>>> of those times resulted in a STONITH of storage1. 
>>>>>>>
>>>>>>> I enabled debug in the corosync config, so I was able to capture a 
>>>>>>> period when corosync died with debug output: 
>>>>>>> http://pastebin.com/eAmJSmsQ 
>>>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. 
>>>>>>> For reference, here is my Pacemaker configuration: 
>>>>>>> http://pastebin.com/DFL3hNvz 
>>>>>>>
>>>>>>> It seems that an extra node, 16777343 "localhost" has been added to the 
>>>>>>> cluster after storage1 was STONTIHed (must be the localhost interface 
>>>>>>> on storage1). Is there anyway to prevent this? 
>>>>>>>
>>>>>>> Does this help to determine why corosync is dying, and what I can do to 
>>>>>>> fix it? 
>>>>>>>
>>>>>>> Thanks, 
>>>>>>>
>>>>>>> Andrew 
>>>>>>>
>>>>>>> ----- Original Message ----- 
>>>>>>>
>>>>>>> From: "Andrew Martin" <amar...@xes-inc.com> 
>>>>>>> To: disc...@corosync.org 
>>>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM 
>>>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>>>>
>>>>>>>
>>>>>>> Hello, 
>>>>>>>
>>>>>>> I recently configured a 3-node fileserver cluster by building Corosync 
>>>>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running 
>>>>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" 
>>>>>>> nodes where the resources run (a DRBD disk, filesystem mount, and 
>>>>>>> samba/nfs daemons), while the third node (storagequorum) is in standby 
>>>>>>> mode and acts as a quorum node for the cluster. Today I discovered that 
>>>>>>> corosync died on both storage0 and storage1 at the same time. Since 
>>>>>>> corosync died, pacemaker shut down as well on both nodes. Because the 
>>>>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), 
>>>>>>> storagequorum was unable to STONITH either node and just left the 
>>>>>>> resources frozen where they were running, on storage0. I cannot find 
>>>>>>> any log information to determine why corosync crashed, and this is a 
>>>>>>> disturbing problem as the cluster and its messaging layer must be 
>>>>>>> stable. Below is my corosync configuration file as well as the corosync 
>>>>>>> log file from e!
 ac! 
>> h! 
>>> ! 
>>>> n! 
>>>>> o! 
>>>>>> de during 
>>>>>> this period. 
>>>>>>>
>>>>>>> corosync.conf: 
>>>>>>> http://pastebin.com/vWQDVmg8 
>>>>>>> Note that I have two redundant rings. On one of them, I specify the IP 
>>>>>>> address (in this example 10.10.10.7) so that it binds to the correct 
>>>>>>> interface (since potentially in the future those machines may have two 
>>>>>>> interfaces on the same subnet). 
>>>>>>>
>>>>>>> corosync.log from storage0: 
>>>>>>> http://pastebin.com/HK8KYDDQ 
>>>>>>>
>>>>>>> corosync.log from storage1: 
>>>>>>> http://pastebin.com/sDWkcPUz 
>>>>>>>
>>>>>>> corosync.log from storagequorum (the DC during this period): 
>>>>>>> http://pastebin.com/uENQ5fnf 
>>>>>>>
>>>>>>> Issuing service corosync start && service pacemaker start on storage0 
>>>>>>> and storage1 resolved the problem and allowed the nodes to successfully 
>>>>>>> reconnect to the cluster. What other information can I provide to help 
>>>>>>> diagnose this problem and prevent it from recurring? 
>>>>>>>
>>>>>>> Thanks, 
>>>>>>>
>>>>>>> Andrew Martin 
>>>>>>>
>>>>>>> _______________________________________________ 
>>>>>>> discuss mailing list 
>>>>>>> disc...@corosync.org 
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________ 
>>>>>>> discuss mailing list 
>>>>>>> disc...@corosync.org 
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>>
>>>>>>
>>>>>
>>>>>> _______________________________________________ 
>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org 
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>> Bugs: http://bugs.clusterlabs.org 
>>>>>
>>>>>
>>>>> _______________________________________________ 
>>>>> discuss mailing list 
>>>>> disc...@corosync.org 
>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>
>>>>
>>>> _______________________________________________ 
>>>> discuss mailing list 
>>>> disc...@corosync.org 
>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>
>>>
>>> _______________________________________________ 
>>> discuss mailing list 
>>> disc...@corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>
>>>
>>> _______________________________________________ 
>>> discuss mailing list 
>>> disc...@corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________ 
>>> discuss mailing list 
>>> disc...@corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>
>>
>>
>>
> 


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to