Hi Steven, hi all.

I often see this assert on one of nodes after I stop corosync on some
another node in newly-setup 4-node cluster.

#0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
#1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
#2  0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
#3  0x00007f5196176406 in memb_consensus_agreed
(instance=0x7f5196554010) at totemsrp.c:1194
#4  0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
memb_join=0x262f628) at totemsrp.c:3918
#5  0x00007f519617b619 in message_handler_memb_join
(instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
optimized out>, endian_conversion_needed=<value optimized out>)
    at totemsrp.c:4161
#6  0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
iface_no=0, context=<value optimized out>, msg=<value optimized out>,
msg_len=<value optimized out>) at totemrrp.c:720
#7  0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
msg=0x262f628, msg_len=420) at totemrrp.c:1404
#8  0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
at totemudp.c:1244
#9  0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
coropoll.c:510
#10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
optimized out>, envp=<value optimized out>) at main.c:1680

Last fplay lines are:

rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
pending delivery queue
rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
pending delivery queue
rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
rec=[36130] Log Message=releasing messages up to and including 1367
rec=[36131] Log Message=FAILED TO RECEIVE
rec=[36132] Log Message=entering GATHER state from 6.
rec=[36133] Log Message=entering GATHER state from 0.
Finishing replay: records found [33993]

What could be the reason for this? Bug, switches, memory errors?

Setup is:
corosync-1.2.8
openais-1.1.4
pacemaker-1.1.4

corosync.conf is:
=============
compatibility: none

totem {
    version: 2
    secauth: off
    # 9192-18
    net_mtu: 9174
    window_size: 300
    max_messages: 25
    rrp_mode: passive

    interface {
        ringnumber: 0
        bindnetaddr: 10.5.4.48
        mcastaddr: 239.94.1.3
        mcastport: 5405
    }
}
logging {
    fileline: off
    to_stderr: no
    to_logfile: no
    to_syslog: yes
    debug: off
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: off
    }
}

amf {
    mode: disabled
}

aisexec {
    user:   root
    group:  root
}
========

Pacemaker is run with MCP:
service {
    name: pacemaker
    ver:  1
}

Current nodes have addresses 10.5.4.52 to 55.

I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
are the only services configured in pacemaker right now (except fencing).

I verified that iptables do not block anything - I log all denied
packets, and logs are clean.

I also use bonding in 802.3ad mode with cisco 3750x stack (physical
interfaces are connected to different switch in stack). Bridge is set up
on top of bonding.

What more do I need to provide to help with resolving this issue?

Best,
Vladislav

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to