Hi,
On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote:
> Hi Steven, hi all.
>
> I often see this assert on one of nodes after I stop corosync on some
> another node in newly-setup 4-node cluster.
Does the assert happen on a node lost event? Or once new
partition is formed?
> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6
> #2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007f5196176406 in memb_consensus_agreed
> (instance=0x7f5196554010) at totemsrp.c:1194
> #4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
> memb_join=0x262f628) at totemsrp.c:3918
> #5 0x00007f519617b619 in message_handler_memb_join
> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
> optimized out>, endian_conversion_needed=<value optimized out>)
> at totemsrp.c:4161
> #6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
> msg_len=<value optimized out>) at totemrrp.c:720
> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
> at totemudp.c:1244
> #9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
> coropoll.c:510
> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
> optimized out>, envp=<value optimized out>) at main.c:1680
>
> Last fplay lines are:
>
> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
> pending delivery queue
> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
> pending delivery queue
> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36130] Log Message=releasing messages up to and including 1367
> rec=[36131] Log Message=FAILED TO RECEIVE
> rec=[36132] Log Message=entering GATHER state from 6.
> rec=[36133] Log Message=entering GATHER state from 0.
> Finishing replay: records found [33993]
>
> What could be the reason for this? Bug, switches, memory errors?
The assertion fails because corosync finds out that
instance->my_proc_list and instance->my_failed_list are
equal. That happens immediately after the "FAILED TO RECEIVE"
message which is issued when fail_recv_const token rotations
happened without any multicast packet received (defaults to 50).
The state goes then from the OPERATIONAL state back to the
GATHER state. Could it be that actually no packets are supposed
to be coming over multicast? In one case the "FAILED TO RECEIVE"
happened 26 seconds after the membership stabilized.
What I've also noticed is that the node which joined the cluster
had marked one of the rings as faulty just before the other node
aborted. Did you also notice this?
Is there anything else of interest in the logs?
Thanks,
Dejan
> Setup is:
> corosync-1.2.8
> openais-1.1.4
> pacemaker-1.1.4
>
> corosync.conf is:
> =============
> compatibility: none
>
> totem {
> version: 2
> secauth: off
> # 9192-18
> net_mtu: 9174
> window_size: 300
> max_messages: 25
> rrp_mode: passive
>
> interface {
> ringnumber: 0
> bindnetaddr: 10.5.4.48
> mcastaddr: 239.94.1.3
> mcastport: 5405
> }
> }
> logging {
> fileline: off
> to_stderr: no
> to_logfile: no
> to_syslog: yes
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> aisexec {
> user: root
> group: root
> }
> ========
>
> Pacemaker is run with MCP:
> service {
> name: pacemaker
> ver: 1
> }
>
> Current nodes have addresses 10.5.4.52 to 55.
>
> I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
> are the only services configured in pacemaker right now (except fencing).
>
> I verified that iptables do not block anything - I log all denied
> packets, and logs are clean.
>
> I also use bonding in 802.3ad mode with cisco 3750x stack (physical
> interfaces are connected to different switch in stack). Bridge is set up
> on top of bonding.
>
> What more do I need to provide to help with resolving this issue?
>
> Best,
> Vladislav
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais