Hi, all

# This problem related to following previous subject and we use same
test environment.
https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html

The start process of corosync fell into an infinite loop
in my test environment.

The corosync process output a lot of following logs to the debug logfile
and start-up process stalled.

-- ha-debug --
Feb 21 15:39:46 node1 corosync[19268]:   [TOTEM ] totemsrp.c:1852
entering GATHER state from 11.
-- ha-debug --

It seems that all nodes sending a lot of "join messages" and
they has no way out of the GATHER state.

This loop is expected operation?

The backtrace of corosync process is that:
(gdb) bt
#0  0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
#1  0x00000031bdcda904 in usleep () from /lib64/libc.so.6
#2  0x000000351ae11245 in memb_join_message_send (instance=0x7f81483aa010)
    at totemsrp.c:2959
#3  0x000000351ae13aeb in memb_state_gather_enter (instance=0x7f81483aa010,
    gather_from=11) at totemsrp.c:1815
#4  0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
    memb_join=0x232e6c8) at totemsrp.c:3997
#5  0x000000351ae175a9 in message_handler_memb_join
(instance=0x7f81483aa010,
    msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
#6  0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0, msg=0x232e6c8,
    msg_len=596) at totemrrp.c:1511
#7  0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
    fd=<value optimized out>, revents=<value optimized out>, data=0x232e020)
    at totemudp.c:1244
#8  0x000000351ae07202 in poll_run (handle=1265737887312248832)
    at coropoll.c:510
#9  0x0000000000406cfd in main (argc=<value optimized out>,
    argv=<value optimized out>, envp=<value optimized out>) at main.c:1813


Our test environment is that:
 RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
 Corosync-1.3.0-1
 Pacemaker-1.0.10-1
 cluster-glue-1.0.6-1
 resource-agents-1.0.3-1


corosync.conf is that:
-- corosync.conf --
compatibility: whitetank

aisexec {
    user: root
    group: root
}

service {
    name: pacemaker
    ver: 0
}

totem {
    version: 2
    secauth: off
    rrp_mode: active
    token: 16000
    consensus: 20000
    clear_node_high_bit: yes
    rrp_problem_count_timeout: 30000
    fail_recv_const: 50
    send_join: 10
    interface {
        ringnumber: 0
        bindnetaddr: AAA.BBB.xxx.0
        mcastaddr: 226.94.1.1
        mcastport: 5405
    }
    interface {
        ringnumber: 1
        bindnetaddr: AAA.BBB.yyy.0
        mcastaddr: 226.94.1.1
        mcastport: 5405
    }
}

logging {
    fileline: on
    to_syslog: yes
    syslog_facility: local1
    syslog_priority: info
    debug: on
    timestamp: on
}
-- corosync.conf --

We tried "fail_recv_const: 5000" and it lighten incidence of problem,
But corosync start-up problem keeps being generated now.

If "send_join: 10" is not set, a lot of multicast packet causes crowding
the network and other network communications are blocked.


Best Regards,

-- 
NAKAHIRA Kazutomo
Infrastructure Software Technology Unit
NTT Open Source Software Center
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to