Re: [Openais] Corosync (version 1.23 on rhel6) crashes when packets are dropped

Stanley, Ephrim Tue, 02 Aug 2011 18:27:08 -0700

Steve,

The version of Corosync that's installed on my box is


        uname -a
        Linux 833873v1.etc.test.gs.com 2.6.32-71.18.2.el6.x86_64 #1 SMP Wed Mar 
2 14:17:40 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

        [root@833873v1 mrg-evaluation]# rpm -qa | grep -i corosync
        corosync-1.2.3-21.el6_0.1.x86_64
        corosynclib-1.2.3-21.el6_0.1.i686
        corosynclib-1.2.3-21.el6_0.1.x86_64

To enable the #defines, I built the source from 
ftp://ftp:[email protected]/downloads/corosync-1.2.3/corosync-1.2.3.tar.gz
  

1.2.3 seems to support rrp. This is what I see in the Corosync logs when I 
start the service

Aug 02 21:18:36 corosync [MAIN  ] main.c:1351 Corosync Cluster Engine 
('1.2.3'): started and ready to provide service.
Aug 02 21:18:36 corosync [MAIN  ] main.c:1352 Corosync built-in features: nss 
rdma
Aug 02 21:18:36 corosync [MAIN  ] main.c:1427 Successfully read main 
configuration file '/etc/corosync/corosync.conf'.
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:822 Token Timeout (1000 ms) 
retransmit timeout (238 ms)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:825 token hold (180 ms) 
retransmits before loss (4 retrans)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:832 join (50 ms) send_join (0 ms) 
consensus (1200 ms) merge (200 ms)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:835 downcheck (1000 ms) fail to 
recv const (50 msgs)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:837 seqno unchanged const (30 
rotations) Maximum network MTU 1402
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:841 window size per rotation (50 
messages) maximum messages per rotation (17 messages)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:845 missed count const (5 messages)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:848 send threads (0 threads)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:851 RRP token expired timeout (238 
ms)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:854 RRP token problem counter 
(2000 ms)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:857 RRP threshold (10 problem 
count)
Aug 02 21:18:36 corosync [TOTEM ] totemsrp.c:859 RRP mode set to active.

I did set ulimit before running Corosync and still no core file.

[root@833873v1 sbin]# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30562
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Btw, I'm able to get the same behavior by dropping packets using iptables.

Thanks, Ephrim.

-----Original Message-----
From: Steven Dake [mailto:[email protected]] 
Sent: Tuesday, August 02, 2011 9:09 PM
To: Stanley, Ephrim [Tech]
Cc: '[email protected]'
Subject: Re: [Openais] Corosync (version 1.23 on rhel6) crashes when packets 
are dropped

On 08/02/2011 04:47 PM, Stanley, Ephrim wrote:
> Hi,
>  
> I'm evaluating the Qpid messaging broker which uses Corosync for
> clustering. As part of my cluster break tests, I ran into a problem
> where Corosync dies without producing any core files or error messages.
>  
> Is this expected ? Also, what are some best practices for testing packet
> loss with Corosync ?
>  
> Steps to reproduce :
> 
>  1. Compile Corosync 1.2.3 after enabling the #defines for packet loss
>     (in totemsrp.c  line 129). I did not change the drop percentages..
>     left them as is 
> 
>  
>   #define TEST_DROP_ORF_TOKEN_PERCENTAGE 30
>   #define TEST_DROP_COMMIT_TOKEN_PERCENTAGE 30
>   #define TEST_DROP_MCAST_PERCENTAGE 50
>   #define TEST_RECOVERY_MSG_COUNT 300
>  
> 
>  2. Start a qpid cluster with three nodes NODE1, NODE2, NODE3
>  3. Nodes NODE2 and NODE3 are run with the Corosync that does not drop
>     packets
>  4. Start the qpid process on nodes NODE2 and NODE3
>  5. After both proceses are up, corosync-cpgtool reports the cluster
>     membership correctly
>  6. On NODE1, start Corosync (that drops packets)
>  7. Corosync starts and packet drops can be observed in the Corosync log
>     (I added some debug log statements)
>  8. Start a qpid process on NODE1
>  9. Now, Corosync crashes on NODE1. No core files are produced. 
> 
>  
> I have attached the output of corosync-fplay on NODE1 and a diff of the
> changes I made to totemsrp.c.
>  
>  
> Thanks, Ephrim.
>  
>  
>  

Ephrim

Could you be more specific about which version of Red Hat's build of
corosync you are using?  Redundant ring is not supported in 1.2.3 by
upstream nor Red Hat.

Looking at existing bugs that have not hit z streams yet, may be this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=722522

to get a core file, set ulimit -c unlimited before running corosync.  A
core file would verify if this is a known fixed problem or a new issue.

Thanks
-steve


> 
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync (version 1.23 on rhel6) crashes when packets are dropped

Reply via email to