Re: [Openais] The start process of corosync fell into an infinite loop

Steven Dake Mon, 21 Feb 2011 11:54:45 -0800

Your firewall may be enabled for the ports corosync uses to communicate.
 Newer versions of corosync have a diag that tells the user this may be
a problem for them.


The firewall needs to be configured properly if it is enabled (which it
is by default in RHEL/FEDORA).  In a rhel environment, this can be done
via system->preferences->firewall GUI or adding your own iptables rules.

Regards
-steve

On 02/21/2011 12:56 AM, NAKAHIRA Kazutomo wrote:
> Hi, all
> 
> # This problem related to following previous subject and we use same
> test environment.
> https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html
> 
> The start process of corosync fell into an infinite loop
> in my test environment.
> 
> The corosync process output a lot of following logs to the debug logfile
> and start-up process stalled.
> 
> -- ha-debug --
> Feb 21 15:39:46 node1 corosync[19268]:   [TOTEM ] totemsrp.c:1852
> entering GATHER state from 11.
> -- ha-debug --
> 
> It seems that all nodes sending a lot of "join messages" and
> they has no way out of the GATHER state.
> 
> This loop is expected operation?
> 
> The backtrace of corosync process is that:
> (gdb) bt
> #0  0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
> #1  0x00000031bdcda904 in usleep () from /lib64/libc.so.6
> #2  0x000000351ae11245 in memb_join_message_send (instance=0x7f81483aa010)
>     at totemsrp.c:2959
> #3  0x000000351ae13aeb in memb_state_gather_enter (instance=0x7f81483aa010,
>     gather_from=11) at totemsrp.c:1815
> #4  0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
>     memb_join=0x232e6c8) at totemsrp.c:3997
> #5  0x000000351ae175a9 in message_handler_memb_join
> (instance=0x7f81483aa010,
>     msg=<value optimized out>, msg_len=<value optimized out>,
>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
> #6  0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0, msg=0x232e6c8,
>     msg_len=596) at totemrrp.c:1511
> #7  0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
>     fd=<value optimized out>, revents=<value optimized out>, data=0x232e020)
>     at totemudp.c:1244
> #8  0x000000351ae07202 in poll_run (handle=1265737887312248832)
>     at coropoll.c:510
> #9  0x0000000000406cfd in main (argc=<value optimized out>,
>     argv=<value optimized out>, envp=<value optimized out>) at main.c:1813
> 
> 
> Our test environment is that:
>  RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
>  Corosync-1.3.0-1
>  Pacemaker-1.0.10-1
>  cluster-glue-1.0.6-1
>  resource-agents-1.0.3-1
> 
> 
> corosync.conf is that:
> -- corosync.conf --
> compatibility: whitetank
> 
> aisexec {
>     user: root
>     group: root
> }
> 
> service {
>     name: pacemaker
>     ver: 0
> }
> 
> totem {
>     version: 2
>     secauth: off
>     rrp_mode: active
>     token: 16000
>     consensus: 20000
>     clear_node_high_bit: yes
>     rrp_problem_count_timeout: 30000
>     fail_recv_const: 50
>     send_join: 10
>     interface {
>         ringnumber: 0
>         bindnetaddr: AAA.BBB.xxx.0
>         mcastaddr: 226.94.1.1
>         mcastport: 5405
>     }
>     interface {
>         ringnumber: 1
>         bindnetaddr: AAA.BBB.yyy.0
>         mcastaddr: 226.94.1.1
>         mcastport: 5405
>     }
> }
> 
> logging {
>     fileline: on
>     to_syslog: yes
>     syslog_facility: local1
>     syslog_priority: info
>     debug: on
>     timestamp: on
> }
> -- corosync.conf --
> 
> We tried "fail_recv_const: 5000" and it lighten incidence of problem,
> But corosync start-up problem keeps being generated now.
> 
> If "send_join: 10" is not set, a lot of multicast packet causes crowding
> the network and other network communications are blocked.
> 
> 
> Best Regards,
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] The start process of corosync fell into an infinite loop

Reply via email to