Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

Vladislav Bogdanov Wed, 01 Dec 2010 07:36:42 -0800

01.12.2010 16:32, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote:
>> Hi Steven, hi all.
>>
>> I often see this assert on one of nodes after I stop corosync on some
>> another node in newly-setup 4-node cluster.
> 
> Does the assert happen on a node lost event? Or once new
> partition is formed?


I first noticed it when I rebooted another node, just after console said
that OpenAIS is stopped.

Can't say right now, what exactly event did it follow, I'm actually
fighting with several problems with corosync, pacemaker, NFS4 and
phantom uncorrectable ECC errors simultaneously and I'm a bit lost with
all of them.

> 
>> #0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
>> #1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
>> #2  0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
>> #3  0x00007f5196176406 in memb_consensus_agreed
>> (instance=0x7f5196554010) at totemsrp.c:1194
>> #4  0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
>> memb_join=0x262f628) at totemsrp.c:3918
>> #5  0x00007f519617b619 in message_handler_memb_join
>> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
>> optimized out>, endian_conversion_needed=<value optimized out>)
>>     at totemsrp.c:4161
>> #6  0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
>> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
>> msg_len=<value optimized out>) at totemrrp.c:720
>> #7  0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
>> msg=0x262f628, msg_len=420) at totemrrp.c:1404
>> #8  0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
>> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
>> at totemudp.c:1244
>> #9  0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
>> coropoll.c:510
>> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
>> optimized out>, envp=<value optimized out>) at main.c:1680
>>
>> Last fplay lines are:
>>
>> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
>> pending delivery queue
>> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
>> pending delivery queue
>> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
>> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
>> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
>> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
>> rec=[36130] Log Message=releasing messages up to and including 1367
>> rec=[36131] Log Message=FAILED TO RECEIVE
>> rec=[36132] Log Message=entering GATHER state from 6.
>> rec=[36133] Log Message=entering GATHER state from 0.
>> Finishing replay: records found [33993]
>>
>> What could be the reason for this? Bug, switches, memory errors?
> 
> The assertion fails because corosync finds out that
> instance->my_proc_list and instance->my_failed_list are
> equal. That happens immediately after the "FAILED TO RECEIVE"
> message which is issued when fail_recv_const token rotations
> happened without any multicast packet received (defaults to 50).
> 
> The state goes then from the OPERATIONAL state back to the
> GATHER state. Could it be that actually no packets are supposed
> to be coming over multicast? In one case the "FAILED TO RECEIVE"
> happened 26 seconds after the membership stabilized.
> 
> What I've also noticed is that the node which joined the cluster
> had marked one of the rings as faulty just before the other node
> aborted. Did you also notice this?

Hmmm...
I have only one ring in my setup. But it is over bridge over 802.3ad
bond over switch-stack.
And I can't say I noticed ring faults in logs.

I'll setup everything to gather hb_report (it is not so easy because I
boot nodes off the PXE with LIVE image in RAM, mount state partitions
over NFS4 and send logs to loghost only) and then try to gather all info
into one place.

Best,
Vladislav
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

Reply via email to