On 12/23/2010 06:14 AM, Dejan Muhamedagic wrote: > Hi, > > On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov wrote: >> 01.12.2010 16:32, Dejan Muhamedagic wrote: >>> Hi, >>> >>> On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov wrote: >>>> Hi Steven, hi all. >>>> >>>> I often see this assert on one of nodes after I stop corosync on some >>>> another node in newly-setup 4-node cluster. >>> >>> Does the assert happen on a node lost event? Or once new >>> partition is formed? >> >> I first noticed it when I rebooted another node, just after console said >> that OpenAIS is stopped. >> >> Can't say right now, what exactly event did it follow, I'm actually >> fighting with several problems with corosync, pacemaker, NFS4 and >> phantom uncorrectable ECC errors simultaneously and I'm a bit lost with >> all of them. >> >>> >>>> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6 >>>> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6 >>>> #2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6 >>>> #3 0x00007f5196176406 in memb_consensus_agreed >>>> (instance=0x7f5196554010) at totemsrp.c:1194 >>>> #4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010, >>>> memb_join=0x262f628) at totemsrp.c:3918 >>>> #5 0x00007f519617b619 in message_handler_memb_join >>>> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value >>>> optimized out>, endian_conversion_needed=<value optimized out>) >>>> at totemsrp.c:4161 >>>> #6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030, >>>> iface_no=0, context=<value optimized out>, msg=<value optimized out>, >>>> msg_len=<value optimized out>) at totemrrp.c:720 >>>> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>, >>>> msg=0x262f628, msg_len=420) at totemrrp.c:1404 >>>> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>, >>>> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80) >>>> at totemudp.c:1244 >>>> #9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at >>>> coropoll.c:510 >>>> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value >>>> optimized out>, envp=<value optimized out>) at main.c:1680 >>>> >>>> Last fplay lines are: >>>> >>>> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to >>>> pending delivery queue >>>> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to >>>> pending delivery queue >>>> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366 >>>> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367 >>>> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366 >>>> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367 >>>> rec=[36130] Log Message=releasing messages up to and including 1367 >>>> rec=[36131] Log Message=FAILED TO RECEIVE >>>> rec=[36132] Log Message=entering GATHER state from 6. >>>> rec=[36133] Log Message=entering GATHER state from 0. >>>> Finishing replay: records found [33993] >>>> >>>> What could be the reason for this? Bug, switches, memory errors? >>> >>> The assertion fails because corosync finds out that >>> instance->my_proc_list and instance->my_failed_list are >>> equal. That happens immediately after the "FAILED TO RECEIVE" >>> message which is issued when fail_recv_const token rotations >>> happened without any multicast packet received (defaults to 50). > > I took a look at the code and the protocol specification again > and it seems like that assert is not valid since Steve patched > the part dealing with the "FAILED TO RECEIVE" condition. The > patch is from 2010-06-03 posted to the list here > http://marc.info/?l=openais&m=127559807608484&w=2 > > The last hunk of the patch contains this code (exec/totemsrp.c): > > 3933 if (memb_consensus_agreed (instance) && instance->failed_to_recv > == 1) { > 3934 instance->failed_to_recv = 0; > 3935 srp_addr_copy (&instance->my_proc_list[0], > 3936 &instance->my_id); > 3937 instance->my_proc_list_entries = 1; > 3938 instance->my_failed_list_entries = 0; > 3939 > 3940 memb_state_commit_token_create (instance); > 3941 > 3942 memb_state_commit_enter (instance); > 3943 return; > 3944 } > > This code never got a chance to run because on failed_to_recv > the two sets (my_process_list and my_failed_list) are equal which > makes the assert fail in memb_consensus_agreed(): > > 1185 memb_set_subtract (token_memb, &token_memb_entries, > 1186 instance->my_proc_list, instance->my_proc_list_entries, > 1187 instance->my_failed_list, instance->my_failed_list_entries); > ... > 1195 assert (token_memb_entries >= 1); > > In other words, it's something like this: > > if A: > if memb_consensus_agreed() and failed_to_recv: > form a single node ring and try to recover > > memb_consensus_agreed(): > assert(!A) > > Steve, can you take a look and confirm that this holds. > > Cheers, >
Dejan, sorry for delay in response - big backlog which is mostly cleared out :) The assert definitely isn't correct, but removing it without addressing the contents of the proc and fail lists is also not right. That would cause the logic in the if statement at line 3933 not to be executed (because the first part of the if would evaluate to false) I believe what we should do is check the "failed_to_recv" value in memb_consensus_agreed instead of at line 3933. The issue with this is memb_state_consensus_timeout_expired which also executes some 'then' logic where we may not want to execute the failed_to_recv logic. If anyone has a reliable reproducer and can forward to me, I'll test out a change to address this problem. Really hesitant to change anything in totemsrp without a test case for this problem - its almost perfect ;-) Regards -steve > Dejan > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
