Re: [Openais] The start process of corosync fell into an infinite loop

NAKAHIRA Kazutomo Tue, 22 Feb 2011 03:50:44 -0800

Hi, Steven

Thank you for your speedy response.


I use iptables but it have no DROP/REJECT rules
for INPUT and OUTPUT chain.

My iptables setting is below:

[root@test1 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere            udp dpt:bootps
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:bootps

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             192.168.xxx.0/24    state 
RELATED,ESTABLISHED
ACCEPT     all  --  192.168.xxx.0/24     anywhere
ACCEPT     all  --  anywhere             anywhere
REJECT     all  --  anywhere             anywhere            reject-with 
icmp-port-unreachable
REJECT     all  --  anywhere             anywhere            reject-with 
icmp-port-unreachable
ACCEPT     all  --  anywhere             anywhere            PHYSDEV 
match --physdev-is-bridged

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

There are any problems?

BTW, I use Corosync-1.3 on the RHEL6 with 12-nodes cluster.
Does anyone have a good record of Corosync-1.3 + RHEL6 with large scale 
cluster?

Best Regards,

(2011/02/22 3:08), Steven Dake wrote:
> Your firewall may be enabled for the ports corosync uses to communicate.
>   Newer versions of corosync have a diag that tells the user this may be
> a problem for them.
>
> The firewall needs to be configured properly if it is enabled (which it
> is by default in RHEL/FEDORA).  In a rhel environment, this can be done
> via system->preferences->firewall GUI or adding your own iptables rules.
>
> Regards
> -steve
>
> On 02/21/2011 12:56 AM, NAKAHIRA Kazutomo wrote:
>> Hi, all
>>
>> # This problem related to following previous subject and we use same
>> test environment.
>> https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html
>>
>> The start process of corosync fell into an infinite loop
>> in my test environment.
>>
>> The corosync process output a lot of following logs to the debug logfile
>> and start-up process stalled.
>>
>> -- ha-debug --
>> Feb 21 15:39:46 node1 corosync[19268]:   [TOTEM ] totemsrp.c:1852
>> entering GATHER state from 11.
>> -- ha-debug --
>>
>> It seems that all nodes sending a lot of "join messages" and
>> they has no way out of the GATHER state.
>>
>> This loop is expected operation?
>>
>> The backtrace of corosync process is that:
>> (gdb) bt
>> #0  0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
>> #1  0x00000031bdcda904 in usleep () from /lib64/libc.so.6
>> #2  0x000000351ae11245 in memb_join_message_send (instance=0x7f81483aa010)
>>      at totemsrp.c:2959
>> #3  0x000000351ae13aeb in memb_state_gather_enter (instance=0x7f81483aa010,
>>      gather_from=11) at totemsrp.c:1815
>> #4  0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
>>      memb_join=0x232e6c8) at totemsrp.c:3997
>> #5  0x000000351ae175a9 in message_handler_memb_join
>> (instance=0x7f81483aa010,
>>      msg=<value optimized out>, msg_len=<value optimized out>,
>>      endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
>> #6  0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0, msg=0x232e6c8,
>>      msg_len=596) at totemrrp.c:1511
>> #7  0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
>>      fd=<value optimized out>, revents=<value optimized out>, data=0x232e020)
>>      at totemudp.c:1244
>> #8  0x000000351ae07202 in poll_run (handle=1265737887312248832)
>>      at coropoll.c:510
>> #9  0x0000000000406cfd in main (argc=<value optimized out>,
>>      argv=<value optimized out>, envp=<value optimized out>) at main.c:1813
>>
>>
>> Our test environment is that:
>>   RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
>>   Corosync-1.3.0-1
>>   Pacemaker-1.0.10-1
>>   cluster-glue-1.0.6-1
>>   resource-agents-1.0.3-1
>>
>>
>> corosync.conf is that:
>> -- corosync.conf --
>> compatibility: whitetank
>>
>> aisexec {
>>      user: root
>>      group: root
>> }
>>
>> service {
>>      name: pacemaker
>>      ver: 0
>> }
>>
>> totem {
>>      version: 2
>>      secauth: off
>>      rrp_mode: active
>>      token: 16000
>>      consensus: 20000
>>      clear_node_high_bit: yes
>>      rrp_problem_count_timeout: 30000
>>      fail_recv_const: 50
>>      send_join: 10
>>      interface {
>>          ringnumber: 0
>>          bindnetaddr: AAA.BBB.xxx.0
>>          mcastaddr: 226.94.1.1
>>          mcastport: 5405
>>      }
>>      interface {
>>          ringnumber: 1
>>          bindnetaddr: AAA.BBB.yyy.0
>>          mcastaddr: 226.94.1.1
>>          mcastport: 5405
>>      }
>> }
>>
>> logging {
>>      fileline: on
>>      to_syslog: yes
>>      syslog_facility: local1
>>      syslog_priority: info
>>      debug: on
>>      timestamp: on
>> }
>> -- corosync.conf --
>>
>> We tried "fail_recv_const: 5000" and it lighten incidence of problem,
>> But corosync start-up problem keeps being generated now.
>>
>> If "send_join: 10" is not set, a lot of multicast packet causes crowding
>> the network and other network communications are blocked.
>>
>>
>> Best Regards,
>>
>

-- 
NAKAHIRA Kazutomo
Infrastructure Software Technology Unit
NTT Open Source Software Center
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] The start process of corosync fell into an infinite loop

Reply via email to