Hi,
On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote:
> Hi Steven, hi all.
>
> I often see this assert on one of nodes after I stop corosync on some
> another node in newly-setup 4-node cluster.
>
> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6
> #2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007f5196176406 in memb_consensus_agreed
> (instance=0x7f5196554010) at totemsrp.c:1194
> #4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
> memb_join=0x262f628) at totemsrp.c:3918
> #5 0x00007f519617b619 in message_handler_memb_join
> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
> optimized out>, endian_conversion_needed=<value optimized out>)
> at totemsrp.c:4161
> #6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
> msg_len=<value optimized out>) at totemrrp.c:720
> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
> at totemudp.c:1244
> #9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
> coropoll.c:510
> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
> optimized out>, envp=<value optimized out>) at main.c:1680
I also got a report which looks very similar. It's a 2-node
cluster and the issue reproduces very often whenever the lost
node reappears and the other node reenters the operational
state. The corosync version is 1.2.6.
0x663ba0 "corosync: totemsrp.c:1194: memb_consensus_agreed: Assertion
`token_memb_entries >= 1' failed.\n"
#3 0x00007f0756d7099f in memb_consensus_agreed (instance=0x7f0757143010)
at totemsrp.c:1194
token_memb = {{addr = {{nodeid = 1461143782, family = 32519, addr =
"\000\000\375\b\327V\a\177\000\000\320\037~\033\377\177"}, {nodeid =
530579456, family = 7038, addr =
"\377\177\000\000T\272\024W\002\000\000\000\001\000\000"}}}, {addr = {{
nodeid = 2, family = 65193, addr =
"\002\f\002\000\251\376\002\f\b\000\002\000\251\376\002\f"}, {nodeid =
262152, family = 65193, addr =
... [tons of similar with apparently random nodeids, corruption?]
...
token_memb_entries = 0
...
#4 0x00007f0756d76d2e in memb_join_process (instance=0x7f0757143010,
memb_join=0x7f074c025bfc) at totemsrp.c:3931
failed_list = 0x7f074c025c98
fail_minus_memb_entries = 0
fail_minus_memb = {{addr = {{nodeid = 1456860480, family = 32519,
... [again many funny structs]
#5 0x00007f0756d773cc in message_handler_memb_join (instance=0x7f0757143010,
msg=<value optimized out>, msg_len=<value optimized out>,
endian_conversion_needed=<value optimized out>) at totemsrp.c:4174
#6 0x00007f0756d6d2c5 in passive_mcast_recv (rrp_instance=0x7f074c021040,
iface_no=0, context=0x1800, msg=0xffffffffffffffff, msg_len=4294967295)
at totemrrp.c:720
> Last fplay lines are:
>
> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
> pending delivery queue
> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
> pending delivery queue
> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36130] Log Message=releasing messages up to and including 1367
> rec=[36131] Log Message=FAILED TO RECEIVE
> rec=[36132] Log Message=entering GATHER state from 6.
> rec=[36133] Log Message=entering GATHER state from 0.
> Finishing replay: records found [33993]
This also looks similar:
rec=[6968] Log Message=info: send_member_notification: Sending membership
update 196 to 2 children
rec=[6969] ENTERING function [amf_confchg_fn] line [1344]
rec=[6970] Log Message=This node is within the primary component and will
provide service.
rec=[6971] Log Message=entering OPERATIONAL state.
rec=[6972] Log Message=A processor joined or left the membership and a new
membership was formed.
....
rec=[7069] Log Message=Delivering 0 to 11
rec=[7070] Log Message=Delivering 0 to 11
rec=[7071] Log Message=mcasted message added to pending queue
rec=[7072] Log Message=mcasted message added to pending queue
rec=[7073] Log Message=Delivering 0 to 13
rec=[7074] Log Message=Received ringid(169.254.2.11:196) seq 13
rec=[7075] Log Message=Delivering 0 to 13
rec=[7076] Log Message=Received ringid(169.254.2.11:196) seq 12
rec=[7077] Log Message=Delivering 0 to 13
rec=[7078] Log Message=Delivering 0 to 13
rec=[7079] Log Message=FAILED TO RECEIVE
rec=[7080] Log Message=entering GATHER state from 6.
rec=[7081] Log Message=mcasted message added to pending queue
rec=[7082] Log Message=mcasted message added to pending queue
rec=[7083] Log Message=entering GATHER state from 0.
Finishing replay: records found [7083]
> What could be the reason for this? Bug, switches, memory errors?
>
> Setup is:
> corosync-1.2.8
> openais-1.1.4
> pacemaker-1.1.4
>
> corosync.conf is:
> =============
> compatibility: none
>
> totem {
> version: 2
> secauth: off
> # 9192-18
> net_mtu: 9174
> window_size: 300
> max_messages: 25
> rrp_mode: passive
>
> interface {
> ringnumber: 0
> bindnetaddr: 10.5.4.48
> mcastaddr: 239.94.1.3
> mcastport: 5405
> }
> }
The totem section of corosync.conf (one difference is that
there's rrp passive with two rings):
totem {
rrp_mode: passive
token_retransmits_before_loss_const: 10
join: 60
max_messages: 20
vsftype: none
token: 3000
consensus: 4000
secauth: on
version: 2
threads: 24
interface {
bindnetaddr: 169.254.2.0
mcastaddr: 239.254.2.248
mcastport: 5405
ringnumber: 0
}
interface {
bindnetaddr: 169.254.12.0
mcastaddr: 239.254.12.248
mcastport: 5407
ringnumber: 1
}
clear_node_high_bit: yes
}
> logging {
> fileline: off
> to_stderr: no
> to_logfile: no
> to_syslog: yes
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> aisexec {
> user: root
> group: root
> }
> ========
>
> Pacemaker is run with MCP:
> service {
> name: pacemaker
> ver: 1
> }
>
> Current nodes have addresses 10.5.4.52 to 55.
>
> I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
> are the only services configured in pacemaker right now (except fencing).
>
> I verified that iptables do not block anything - I log all denied
> packets, and logs are clean.
>
> I also use bonding in 802.3ad mode with cisco 3750x stack (physical
> interfaces are connected to different switch in stack). Bridge is set up
> on top of bonding.
>
> What more do I need to provide to help with resolving this issue?
More information available on request here too.
Thanks,
Dejan
> Best,
> Vladislav
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais