Hi,

On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote:
> Hi Steven, hi all.
> 
> I often see this assert on one of nodes after I stop corosync on some
> another node in newly-setup 4-node cluster.
> 
> #0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> #1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
> #2  0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
> #3  0x00007f5196176406 in memb_consensus_agreed
> (instance=0x7f5196554010) at totemsrp.c:1194
> #4  0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
> memb_join=0x262f628) at totemsrp.c:3918
> #5  0x00007f519617b619 in message_handler_memb_join
> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
> optimized out>, endian_conversion_needed=<value optimized out>)
>     at totemsrp.c:4161
> #6  0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
> msg_len=<value optimized out>) at totemrrp.c:720
> #7  0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> #8  0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
> at totemudp.c:1244
> #9  0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
> coropoll.c:510
> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
> optimized out>, envp=<value optimized out>) at main.c:1680

I also got a report which looks very similar. It's a 2-node
cluster and the issue reproduces very often whenever the lost
node reappears and the other node reenters the operational
state. The corosync version is 1.2.6.

        0x663ba0 "corosync: totemsrp.c:1194: memb_consensus_agreed: Assertion 
`token_memb_entries >= 1' failed.\n"

#3  0x00007f0756d7099f in memb_consensus_agreed (instance=0x7f0757143010)
    at totemsrp.c:1194
        token_memb = {{addr = {{nodeid = 1461143782, family = 32519, addr =
    "\000\000\375\b\327V\a\177\000\000\320\037~\033\377\177"}, {nodeid =
            530579456, family = 7038, addr =
    "\377\177\000\000T\272\024W\002\000\000\000\001\000\000"}}}, {addr = {{
                nodeid = 2, family = 65193, addr =
    "\002\f\002\000\251\376\002\f\b\000\002\000\251\376\002\f"}, {nodeid =
    262152, family = 65193, addr =
        ... [tons of similar with apparently random nodeids, corruption?]
        ...
        token_memb_entries = 0
        ...
#4  0x00007f0756d76d2e in memb_join_process (instance=0x7f0757143010, 
    memb_join=0x7f074c025bfc) at totemsrp.c:3931
        failed_list = 0x7f074c025c98
        fail_minus_memb_entries = 0
        fail_minus_memb = {{addr = {{nodeid = 1456860480, family = 32519, 
        ... [again many funny structs]
#5  0x00007f0756d773cc in message_handler_memb_join (instance=0x7f0757143010, 
    msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4174
#6  0x00007f0756d6d2c5 in passive_mcast_recv (rrp_instance=0x7f074c021040, 
    iface_no=0, context=0x1800, msg=0xffffffffffffffff, msg_len=4294967295)
    at totemrrp.c:720

> Last fplay lines are:
> 
> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
> pending delivery queue
> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
> pending delivery queue
> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36130] Log Message=releasing messages up to and including 1367
> rec=[36131] Log Message=FAILED TO RECEIVE
> rec=[36132] Log Message=entering GATHER state from 6.
> rec=[36133] Log Message=entering GATHER state from 0.
> Finishing replay: records found [33993]

This also looks similar:

rec=[6968] Log Message=info: send_member_notification: Sending membership 
update 196 to 2 children
rec=[6969] ENTERING function [amf_confchg_fn] line [1344]
rec=[6970] Log Message=This node is within the primary component and will 
provide service.
rec=[6971] Log Message=entering OPERATIONAL state.
rec=[6972] Log Message=A processor joined or left the membership and a new 
membership was formed.
....
rec=[7069] Log Message=Delivering 0 to 11
rec=[7070] Log Message=Delivering 0 to 11
rec=[7071] Log Message=mcasted message added to pending queue
rec=[7072] Log Message=mcasted message added to pending queue
rec=[7073] Log Message=Delivering 0 to 13
rec=[7074] Log Message=Received ringid(169.254.2.11:196) seq 13
rec=[7075] Log Message=Delivering 0 to 13
rec=[7076] Log Message=Received ringid(169.254.2.11:196) seq 12
rec=[7077] Log Message=Delivering 0 to 13
rec=[7078] Log Message=Delivering 0 to 13
rec=[7079] Log Message=FAILED TO RECEIVE
rec=[7080] Log Message=entering GATHER state from 6.
rec=[7081] Log Message=mcasted message added to pending queue
rec=[7082] Log Message=mcasted message added to pending queue
rec=[7083] Log Message=entering GATHER state from 0.
Finishing replay: records found [7083]

> What could be the reason for this? Bug, switches, memory errors?
> 
> Setup is:
> corosync-1.2.8
> openais-1.1.4
> pacemaker-1.1.4
> 
> corosync.conf is:
> =============
> compatibility: none
> 
> totem {
>     version: 2
>     secauth: off
>     # 9192-18
>     net_mtu: 9174
>     window_size: 300
>     max_messages: 25
>     rrp_mode: passive
> 
>     interface {
>         ringnumber: 0
>         bindnetaddr: 10.5.4.48
>         mcastaddr: 239.94.1.3
>         mcastport: 5405
>     }
> }

The totem section of corosync.conf (one difference is that
there's rrp passive with two rings):

totem {
        rrp_mode:       passive
        token_retransmits_before_loss_const:    10
        join:   60
        max_messages:   20
        vsftype:        none
        token:  3000
        consensus:      4000
        secauth:        on
        version:        2
        threads:        24
        interface {
                bindnetaddr:    169.254.2.0
                mcastaddr:         239.254.2.248
                mcastport:         5405
                ringnumber:        0
        }
        interface {
                bindnetaddr:    169.254.12.0
                mcastaddr:         239.254.12.248
                mcastport:         5407
                ringnumber:     1
        }
        clear_node_high_bit:    yes
}

> logging {
>     fileline: off
>     to_stderr: no
>     to_logfile: no
>     to_syslog: yes
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: AMF
>         debug: off
>     }
> }
> 
> amf {
>     mode: disabled
> }
> 
> aisexec {
>     user:   root
>     group:  root
> }
> ========
> 
> Pacemaker is run with MCP:
> service {
>     name: pacemaker
>     ver:  1
> }
> 
> Current nodes have addresses 10.5.4.52 to 55.
> 
> I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
> are the only services configured in pacemaker right now (except fencing).
> 
> I verified that iptables do not block anything - I log all denied
> packets, and logs are clean.
> 
> I also use bonding in 802.3ad mode with cisco 3750x stack (physical
> interfaces are connected to different switch in stack). Bridge is set up
> on top of bonding.
> 
> What more do I need to provide to help with resolving this issue?

More information available on request here too.

Thanks,

Dejan

> Best,
> Vladislav
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to