Re: [Openais] The start process of corosync fell into an infinite loop

NAKAHIRA Kazutomo Wed, 23 Feb 2011 18:44:13 -0800

(2011/02/23 2:43), Steven Dake wrote:
> On 02/22/2011 04:47 AM, NAKAHIRA Kazutomo wrote:
>> Hi, Steven
>>
>> Thank you for your speedy response.
>>
>> I use iptables but it have no DROP/REJECT rules
>> for INPUT and OUTPUT chain.
>>
>> My iptables setting is below:
>>
>> [root@test1 ~]# iptables -L
>> Chain INPUT (policy ACCEPT)
>> target     prot opt source               destination
>> ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
>> ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
>> ACCEPT     udp  --  anywhere             anywhere            udp dpt:bootps
>> ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:bootps
>>
>> Chain FORWARD (policy ACCEPT)
>> target     prot opt source               destination
>> ACCEPT     all  --  anywhere             192.168.xxx.0/24    state
>> RELATED,ESTABLISHED
>> ACCEPT     all  --  192.168.xxx.0/24     anywhere
>> ACCEPT     all  --  anywhere             anywhere
>> REJECT     all  --  anywhere             anywhere            reject-with
>> icmp-port-unreachable
>> REJECT     all  --  anywhere             anywhere            reject-with
>> icmp-port-unreachable
>> ACCEPT     all  --  anywhere             anywhere            PHYSDEV
>> match --physdev-is-bridged
>>
>> Chain OUTPUT (policy ACCEPT)
>> target     prot opt source               destination
>>
>> There are any problems?
>>
>> BTW, I use Corosync-1.3 on the RHEL6 with 12-nodes cluster.
>> Does anyone have a good record of Corosync-1.3 + RHEL6 with large scale
>> cluster?
>>
>
> Are you running on bare metal equipment?  Or in a VM?  If running in a
> virtual machine, iptables should also be configured properly on the vm host.


I running on 12-nodes bare metal equipment with L3-switch.

I test same rpms and settings on the 12-VMs(KVM) environment,
and no problem occurred.

> There are significant deployments with corosync-1.2.3 shipped in rhel6
> which is similar to 1.3.0.
>
> I am a little stumped.  A few things you can try:
>
> turn off selinux
> turn off iptables (service iptables stop)

The selinux is already disabled.
I turn off iptables, but problem occurred.

I think that the root cause of problem is some sort of network error.
(maybe multicast packet dropped somewhere)
But, corosync fell into GATHER loop and flooding multicast packets is 
serious problem.

An expected operation is that faulty nodes are listed into
the failed_list and consensus agreed.
However, each node's failed_list are unmatched indefinitely
and consensus never agreed.

Best Regards,

> And see if either of those solve the problem.
>
> Since it sounds like your building corosync on your own, try the
> pre-built binaries, and see if those work.
>
> Regards
> -steve
>
>> Best Regards,
>>
>> (2011/02/22 3:08), Steven Dake wrote:
>>> Your firewall may be enabled for the ports corosync uses to communicate.
>>>    Newer versions of corosync have a diag that tells the user this may be
>>> a problem for them.
>>>
>>> The firewall needs to be configured properly if it is enabled (which it
>>> is by default in RHEL/FEDORA).  In a rhel environment, this can be done
>>> via system->preferences->firewall GUI or adding your own iptables rules.
>>>
>>> Regards
>>> -steve
>>>
>>> On 02/21/2011 12:56 AM, NAKAHIRA Kazutomo wrote:
>>>> Hi, all
>>>>
>>>> # This problem related to following previous subject and we use same
>>>> test environment.
>>>> https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html
>>>>
>>>>
>>>> The start process of corosync fell into an infinite loop
>>>> in my test environment.
>>>>
>>>> The corosync process output a lot of following logs to the debug logfile
>>>> and start-up process stalled.
>>>>
>>>> -- ha-debug --
>>>> Feb 21 15:39:46 node1 corosync[19268]:   [TOTEM ] totemsrp.c:1852
>>>> entering GATHER state from 11.
>>>> -- ha-debug --
>>>>
>>>> It seems that all nodes sending a lot of "join messages" and
>>>> they has no way out of the GATHER state.
>>>>
>>>> This loop is expected operation?
>>>>
>>>> The backtrace of corosync process is that:
>>>> (gdb) bt
>>>> #0  0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
>>>> #1  0x00000031bdcda904 in usleep () from /lib64/libc.so.6
>>>> #2  0x000000351ae11245 in memb_join_message_send
>>>> (instance=0x7f81483aa010)
>>>>       at totemsrp.c:2959
>>>> #3  0x000000351ae13aeb in memb_state_gather_enter
>>>> (instance=0x7f81483aa010,
>>>>       gather_from=11) at totemsrp.c:1815
>>>> #4  0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
>>>>       memb_join=0x232e6c8) at totemsrp.c:3997
>>>> #5  0x000000351ae175a9 in message_handler_memb_join
>>>> (instance=0x7f81483aa010,
>>>>       msg=<value optimized out>, msg_len=<value optimized out>,
>>>>       endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
>>>> #6  0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0,
>>>> msg=0x232e6c8,
>>>>       msg_len=596) at totemrrp.c:1511
>>>> #7  0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
>>>>       fd=<value optimized out>, revents=<value optimized out>,
>>>> data=0x232e020)
>>>>       at totemudp.c:1244
>>>> #8  0x000000351ae07202 in poll_run (handle=1265737887312248832)
>>>>       at coropoll.c:510
>>>> #9  0x0000000000406cfd in main (argc=<value optimized out>,
>>>>       argv=<value optimized out>, envp=<value optimized out>) at
>>>> main.c:1813
>>>>
>>>>
>>>> Our test environment is that:
>>>>    RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
>>>>    Corosync-1.3.0-1
>>>>    Pacemaker-1.0.10-1
>>>>    cluster-glue-1.0.6-1
>>>>    resource-agents-1.0.3-1
>>>>
>>>>
>>>> corosync.conf is that:
>>>> -- corosync.conf --
>>>> compatibility: whitetank
>>>>
>>>> aisexec {
>>>>       user: root
>>>>       group: root
>>>> }
>>>>
>>>> service {
>>>>       name: pacemaker
>>>>       ver: 0
>>>> }
>>>>
>>>> totem {
>>>>       version: 2
>>>>       secauth: off
>>>>       rrp_mode: active
>>>>       token: 16000
>>>>       consensus: 20000
>>>>       clear_node_high_bit: yes
>>>>       rrp_problem_count_timeout: 30000
>>>>       fail_recv_const: 50
>>>>       send_join: 10
>>>>       interface {
>>>>           ringnumber: 0
>>>>           bindnetaddr: AAA.BBB.xxx.0
>>>>           mcastaddr: 226.94.1.1
>>>>           mcastport: 5405
>>>>       }
>>>>       interface {
>>>>           ringnumber: 1
>>>>           bindnetaddr: AAA.BBB.yyy.0
>>>>           mcastaddr: 226.94.1.1
>>>>           mcastport: 5405
>>>>       }
>>>> }
>>>>
>>>> logging {
>>>>       fileline: on
>>>>       to_syslog: yes
>>>>       syslog_facility: local1
>>>>       syslog_priority: info
>>>>       debug: on
>>>>       timestamp: on
>>>> }
>>>> -- corosync.conf --
>>>>
>>>> We tried "fail_recv_const: 5000" and it lighten incidence of problem,
>>>> But corosync start-up problem keeps being generated now.
>>>>
>>>> If "send_join: 10" is not set, a lot of multicast packet causes crowding
>>>> the network and other network communications are blocked.
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>
>>
>
>


-- 
NAKAHIRA Kazutomo
Infrastructure Software Technology Unit
NTT Open Source Software Center
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] The start process of corosync fell into an infinite loop

Reply via email to