Re: [Openais] The start process of corosync fell into an infinite loop

Steven Dake Tue, 22 Feb 2011 10:47:42 -0800

On 02/22/2011 04:47 AM, NAKAHIRA Kazutomo wrote:
> Hi, Steven
> 
> Thank you for your speedy response.
> 
> I use iptables but it have no DROP/REJECT rules
> for INPUT and OUTPUT chain.
> 
> My iptables setting is below:
> 
> [root@test1 ~]# iptables -L
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
> ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
> ACCEPT     udp  --  anywhere             anywhere            udp dpt:bootps
> ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:bootps
> 
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> ACCEPT     all  --  anywhere             192.168.xxx.0/24    state
> RELATED,ESTABLISHED
> ACCEPT     all  --  192.168.xxx.0/24     anywhere
> ACCEPT     all  --  anywhere             anywhere
> REJECT     all  --  anywhere             anywhere            reject-with
> icmp-port-unreachable
> REJECT     all  --  anywhere             anywhere            reject-with
> icmp-port-unreachable
> ACCEPT     all  --  anywhere             anywhere            PHYSDEV
> match --physdev-is-bridged
> 
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> There are any problems?
> 
> BTW, I use Corosync-1.3 on the RHEL6 with 12-nodes cluster.
> Does anyone have a good record of Corosync-1.3 + RHEL6 with large scale
> cluster?
>


Are you running on bare metal equipment?  Or in a VM?  If running in a
virtual machine, iptables should also be configured properly on the vm host.

There are significant deployments with corosync-1.2.3 shipped in rhel6
which is similar to 1.3.0.

I am a little stumped.  A few things you can try:

turn off selinux
turn off iptables (service iptables stop)

And see if either of those solve the problem.

Since it sounds like your building corosync on your own, try the
pre-built binaries, and see if those work.

Regards
-steve

> Best Regards,
> 
> (2011/02/22 3:08), Steven Dake wrote:
>> Your firewall may be enabled for the ports corosync uses to communicate.
>>   Newer versions of corosync have a diag that tells the user this may be
>> a problem for them.
>>
>> The firewall needs to be configured properly if it is enabled (which it
>> is by default in RHEL/FEDORA).  In a rhel environment, this can be done
>> via system->preferences->firewall GUI or adding your own iptables rules.
>>
>> Regards
>> -steve
>>
>> On 02/21/2011 12:56 AM, NAKAHIRA Kazutomo wrote:
>>> Hi, all
>>>
>>> # This problem related to following previous subject and we use same
>>> test environment.
>>> https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html
>>>
>>>
>>> The start process of corosync fell into an infinite loop
>>> in my test environment.
>>>
>>> The corosync process output a lot of following logs to the debug logfile
>>> and start-up process stalled.
>>>
>>> -- ha-debug --
>>> Feb 21 15:39:46 node1 corosync[19268]:   [TOTEM ] totemsrp.c:1852
>>> entering GATHER state from 11.
>>> -- ha-debug --
>>>
>>> It seems that all nodes sending a lot of "join messages" and
>>> they has no way out of the GATHER state.
>>>
>>> This loop is expected operation?
>>>
>>> The backtrace of corosync process is that:
>>> (gdb) bt
>>> #0  0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
>>> #1  0x00000031bdcda904 in usleep () from /lib64/libc.so.6
>>> #2  0x000000351ae11245 in memb_join_message_send
>>> (instance=0x7f81483aa010)
>>>      at totemsrp.c:2959
>>> #3  0x000000351ae13aeb in memb_state_gather_enter
>>> (instance=0x7f81483aa010,
>>>      gather_from=11) at totemsrp.c:1815
>>> #4  0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
>>>      memb_join=0x232e6c8) at totemsrp.c:3997
>>> #5  0x000000351ae175a9 in message_handler_memb_join
>>> (instance=0x7f81483aa010,
>>>      msg=<value optimized out>, msg_len=<value optimized out>,
>>>      endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
>>> #6  0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0,
>>> msg=0x232e6c8,
>>>      msg_len=596) at totemrrp.c:1511
>>> #7  0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
>>>      fd=<value optimized out>, revents=<value optimized out>,
>>> data=0x232e020)
>>>      at totemudp.c:1244
>>> #8  0x000000351ae07202 in poll_run (handle=1265737887312248832)
>>>      at coropoll.c:510
>>> #9  0x0000000000406cfd in main (argc=<value optimized out>,
>>>      argv=<value optimized out>, envp=<value optimized out>) at
>>> main.c:1813
>>>
>>>
>>> Our test environment is that:
>>>   RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
>>>   Corosync-1.3.0-1
>>>   Pacemaker-1.0.10-1
>>>   cluster-glue-1.0.6-1
>>>   resource-agents-1.0.3-1
>>>
>>>
>>> corosync.conf is that:
>>> -- corosync.conf --
>>> compatibility: whitetank
>>>
>>> aisexec {
>>>      user: root
>>>      group: root
>>> }
>>>
>>> service {
>>>      name: pacemaker
>>>      ver: 0
>>> }
>>>
>>> totem {
>>>      version: 2
>>>      secauth: off
>>>      rrp_mode: active
>>>      token: 16000
>>>      consensus: 20000
>>>      clear_node_high_bit: yes
>>>      rrp_problem_count_timeout: 30000
>>>      fail_recv_const: 50
>>>      send_join: 10
>>>      interface {
>>>          ringnumber: 0
>>>          bindnetaddr: AAA.BBB.xxx.0
>>>          mcastaddr: 226.94.1.1
>>>          mcastport: 5405
>>>      }
>>>      interface {
>>>          ringnumber: 1
>>>          bindnetaddr: AAA.BBB.yyy.0
>>>          mcastaddr: 226.94.1.1
>>>          mcastport: 5405
>>>      }
>>> }
>>>
>>> logging {
>>>      fileline: on
>>>      to_syslog: yes
>>>      syslog_facility: local1
>>>      syslog_priority: info
>>>      debug: on
>>>      timestamp: on
>>> }
>>> -- corosync.conf --
>>>
>>> We tried "fail_recv_const: 5000" and it lighten incidence of problem,
>>> But corosync start-up problem keeps being generated now.
>>>
>>> If "send_join: 10" is not set, a lot of multicast packet causes crowding
>>> the network and other network communications are blocked.
>>>
>>>
>>> Best Regards,
>>>
>>
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] The start process of corosync fell into an infinite loop

Reply via email to