Hi, all
# This problem related to following previous subject and we use same
test environment.
https://lists.linux-foundation.org/pipermail/openais/2011-February/015673.html
The start process of corosync fell into an infinite loop
in my test environment.
The corosync process output a lot of following logs to the debug logfile
and start-up process stalled.
-- ha-debug --
Feb 21 15:39:46 node1 corosync[19268]: [TOTEM ] totemsrp.c:1852
entering GATHER state from 11.
-- ha-debug --
It seems that all nodes sending a lot of "join messages" and
they has no way out of the GATHER state.
This loop is expected operation?
The backtrace of corosync process is that:
(gdb) bt
#0 0x00000031bdca6a8d in nanosleep () from /lib64/libc.so.6
#1 0x00000031bdcda904 in usleep () from /lib64/libc.so.6
#2 0x000000351ae11245 in memb_join_message_send (instance=0x7f81483aa010)
at totemsrp.c:2959
#3 0x000000351ae13aeb in memb_state_gather_enter (instance=0x7f81483aa010,
gather_from=11) at totemsrp.c:1815
#4 0x000000351ae16e22 in memb_join_process (instance=0x7f81483aa010,
memb_join=0x232e6c8) at totemsrp.c:3997
#5 0x000000351ae175a9 in message_handler_memb_join
(instance=0x7f81483aa010,
msg=<value optimized out>, msg_len=<value optimized out>,
endian_conversion_needed=<value optimized out>) at totemsrp.c:4161
#6 0x000000351ae0e9a4 in rrp_deliver_fn (context=0x23022e0, msg=0x232e6c8,
msg_len=596) at totemrrp.c:1511
#7 0x000000351ae0b4d6 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, revents=<value optimized out>, data=0x232e020)
at totemudp.c:1244
#8 0x000000351ae07202 in poll_run (handle=1265737887312248832)
at coropoll.c:510
#9 0x0000000000406cfd in main (argc=<value optimized out>,
argv=<value optimized out>, envp=<value optimized out>) at main.c:1813
Our test environment is that:
RHEL6(kernel 2.6.32-71.14.1.el6.x86_64)
Corosync-1.3.0-1
Pacemaker-1.0.10-1
cluster-glue-1.0.6-1
resource-agents-1.0.3-1
corosync.conf is that:
-- corosync.conf --
compatibility: whitetank
aisexec {
user: root
group: root
}
service {
name: pacemaker
ver: 0
}
totem {
version: 2
secauth: off
rrp_mode: active
token: 16000
consensus: 20000
clear_node_high_bit: yes
rrp_problem_count_timeout: 30000
fail_recv_const: 50
send_join: 10
interface {
ringnumber: 0
bindnetaddr: AAA.BBB.xxx.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: AAA.BBB.yyy.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
fileline: on
to_syslog: yes
syslog_facility: local1
syslog_priority: info
debug: on
timestamp: on
}
-- corosync.conf --
We tried "fail_recv_const: 5000" and it lighten incidence of problem,
But corosync start-up problem keeps being generated now.
If "send_join: 10" is not set, a lot of multicast packet causes crowding
the network and other network communications are blocked.
Best Regards,
--
NAKAHIRA Kazutomo
Infrastructure Software Technology Unit
NTT Open Source Software Center
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais