Hi Steven, hi all.
I often see this assert on one of nodes after I stop corosync on some
another node in newly-setup 4-node cluster.
#0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6
#1 0x00007f51953e6185 in abort () from /lib64/libc.so.6
#2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
#3 0x00007f5196176406 in memb_consensus_agreed
(instance=0x7f5196554010) at totemsrp.c:1194
#4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
memb_join=0x262f628) at totemsrp.c:3918
#5 0x00007f519617b619 in message_handler_memb_join
(instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
optimized out>, endian_conversion_needed=<value optimized out>)
at totemsrp.c:4161
#6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
iface_no=0, context=<value optimized out>, msg=<value optimized out>,
msg_len=<value optimized out>) at totemrrp.c:720
#7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
msg=0x262f628, msg_len=420) at totemrrp.c:1404
#8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
at totemudp.c:1244
#9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
coropoll.c:510
#10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
optimized out>, envp=<value optimized out>) at main.c:1680
Last fplay lines are:
rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
pending delivery queue
rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
pending delivery queue
rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
rec=[36130] Log Message=releasing messages up to and including 1367
rec=[36131] Log Message=FAILED TO RECEIVE
rec=[36132] Log Message=entering GATHER state from 6.
rec=[36133] Log Message=entering GATHER state from 0.
Finishing replay: records found [33993]
What could be the reason for this? Bug, switches, memory errors?
Setup is:
corosync-1.2.8
openais-1.1.4
pacemaker-1.1.4
corosync.conf is:
=============
compatibility: none
totem {
version: 2
secauth: off
# 9192-18
net_mtu: 9174
window_size: 300
max_messages: 25
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.5.4.48
mcastaddr: 239.94.1.3
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
aisexec {
user: root
group: root
}
========
Pacemaker is run with MCP:
service {
name: pacemaker
ver: 1
}
Current nodes have addresses 10.5.4.52 to 55.
I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
are the only services configured in pacemaker right now (except fencing).
I verified that iptables do not block anything - I log all denied
packets, and logs are clean.
I also use bonding in 802.3ad mode with cisco 3750x stack (physical
interfaces are connected to different switch in stack). Bridge is set up
on top of bonding.
What more do I need to provide to help with resolving this issue?
Best,
Vladislav
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais