Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

renayama19661014 Wed, 09 Feb 2011 18:01:13 -0800

Hi Steven,

> Please use the suggested workaround posted earlier toda.


Where are these contents? 
Will you please teach a link?

> I have tested the  patch I sent to the list previously and i does not fix the 
> problem.

All right.
We shelve the test of this patch.


Best Regards,
Hideo Yamauchi.



--- Steven Dake <[email protected]> wrote:

> Hideo,
> 
> Please use the suggested workaround posted earlier toda.  I have tested
> the  patch I sent to the list previously and i does not fix the problem.
> 
> Regards
> -steve
> 
> On 02/09/2011 05:41 PM, [email protected] wrote:
> > Hi Steven,
> > 
> > Thank you for comment.
> > 
> >>> Is your patch 2 of the next?
> >>>
> >>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
> >>> the token
> >>>
> >>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
> >>> the token(take 2)
> >>>
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>>
> >>
> >> Yes the take 2 version.
> > 
> > All right.
> > Thanks!
> > 
> > Hideo Yamauch.
> > 
> > --- Steven Dake <[email protected]> wrote:
> > 
> >> On 02/08/2011 05:54 PM, [email protected] wrote:
> >>> Hi Steven,
> >>>
> >>>> Have a try of the patch i have sent to this ml.  If the issue persists,
> >>>> we can look at more options.
> >>>
> >>> Thank you for comment.
> >>>
> >>> Is your patch 2 of the next?
> >>>
> >>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
> >>> the token
> >>>
> >>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
> >>> the token(take 2)
> >>>
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>>
> >>
> >> Yes the take 2 version.
> >>
> >> Regards
> >> -steve
> >>
> >>
> >>>
> >>>
> >>> --- Steven Dake <[email protected]> wrote:
> >>>
> >>>> On 02/07/2011 11:11 PM, [email protected] wrote:
> >>>>> Hi Steven,
> >>>>>
> >>>>> I understood your opinion by mistake.
> >>>>>
> >>>>> We do not have simple test case.
> >>>>>
> >>>>> The phenomenon generated in our environment is the following thing.
> >>>>>
> >>>>> Step 1) corosync constitutes a cluster in 12 nodes.
> >>>>>  * begin communication in TOKEN
> >>>>>
> >>>>> Step 2) One node raises [FAILED TO RECEIVE].
> >>>>>
> >>>>> Step 3) 12 nodes begin the reconfiguration of the cluster again.
> >>>>>
> >>>>> Step 4) The node that occurred fails([FAILED TO RECEIVE]) in an 
> >>>>> consensus of the JOIN
> >>>> communication.
> >>>>>  * Because the node failed in an consensus, node make contents of 
> >>>>> faildlist and proclist
> >> same.
> >>>>>  * And this node compares faildlist with proclist and assert-fail 
> >>>>> happened.
> >>>>>
> >>>>>
> >>>>> When the node that made a cluster stood alone, I think that assert() is 
> >>>>> unnecessary.
> >>>>>
> >>>>> Because the reason is because there is the next processing.
> >>>>>
> >>>>>
> >>>>
> >>>> Have a try of the patch i have sent to this ml.  If the issue persists,
> >>>> we can look at more options.
> >>>>
> >>>> Thanks!
> >>>> -steve
> >>>>
> >>>>
> >>>>>
> >>>>> static void memb_join_process (
> >>>>>         struct totemsrp_instance *instance,
> >>>>>         const struct memb_join *memb_join)
> >>>>> {
> >>>>>         struct srp_addr *proc_list;
> >>>>>         struct srp_addr *failed_list;
> >>>>> (snip)
> >>>>>                                 instance->failed_to_recv = 0;
> >>>>>                                 srp_addr_copy 
> >>>>> (&instance->my_proc_list[0],
> >>>>>                                         &instance->my_id);
> >>>>>                                 instance->my_proc_list_entries = 1;
> >>>>>                                 instance->my_failed_list_entries = 0;
> >>>>>
> >>>>>                                 memb_state_commit_token_create 
> >>>>> (instance);
> >>>>>
> >>>>>                                 memb_state_commit_enter (instance);
> >>>>>                                 return;
> >>>>>
> >>>>> (snip)
> >>>>>
> >>>>> Best Regards,
> >>>>> Hideo Yamauchi.
> >>>>>
> >>>>>
> >>>>>
> >>>>> --- [email protected] wrote:
> >>>>>
> >>>>>> Hi Steven,
> >>>>>>
> >>>>>>> Hideo,
> >>>>>>>
> >>>>>>> If you have a test case, I can make a patch for you to try.
> >>>>>>>
> >>>>>>
> >>>>>> All right.
> >>>>>>
> >>>>>> We use corosync.1.3.0.
> >>>>>>
> >>>>>> Please send me patch.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Hideo Yamauchi.
> >>>>>>
> >>>>>> --- Steven Dake <[email protected]> wrote:
> >>>>>>
> >>>>>>> On 02/06/2011 09:16 PM, [email protected] wrote:
> >>>>>>>> Hi Steven,
> >>>>>>>> Hi Dejan,
> >>>>>>>>
> >>>>>>>>>>>> This code never got a chance to run because on failed_to_recv
> >>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which
> >>>>>>>>>>>> makes the assert fail in memb_consensus_agreed():
> >>>>>>>>
> >>>>>>>> The same problem occurs, and we are troubled, too. 
> >>>>>>>>
> >>>>>>>> How did this argument turn out?
> >>>>>>>>
> >>>>>>>> Best Regards,
> >>>>>>>> Hideo Yamauchi.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hideo,
> >>>>>>>
> >>>>>>> If you have a test case, I can make a patch for you to try.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> -steve
> >>>>>>>
> >>>>>>>>
> >>>>>>>> --- Dejan Muhamedagic <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> nudge, nudge
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 05, 2011 at 02:05:55PM +0100, Dejan Muhamedagic wrote:
> >>>>>>>>>> On Tue, Jan 04, 2011 at 01:53:00PM -0700, Steven Dake wrote:
> >>>>>>>>>>> On 12/23/2010 06:14 AM, Dejan Muhamedagic wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov 
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> 01.12.2010 16:32, Dejan Muhamedagic wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov 
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Hi Steven, hi all.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I often see this assert on one of nodes after I stop corosync 
> >>>>>>>>>>>>>>> on some
> >>>>>>>>>>>>>>> another node in newly-setup 4-node cluster.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Does the assert happen on a node lost event? Or once new
> >>>>>>>>>>>>>> partition is formed?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I first noticed it when I rebooted another node, just after 
> >>>>>>>>>>>>> console said
> >>>>>>>>>>>>> that OpenAIS is stopped.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can't say right now, what exactly event did it follow, I'm 
> >>>>>>>>>>>>> actually
> >>>>>>>>>>>>> fighting with several problems with corosync, pacemaker, NFS4 
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>> phantom uncorrectable ECC errors simultaneously and I'm a bit 
> >>>>>>>>>>>>> lost with
> >>>>>>>>>>>>> all of them.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> >>>>>>>>>>>>>>> #1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
> >>>>>>>>>>>>>>> #2  0x00007f51953dd935 in __assert_fail () from 
> >>>>>>>>>>>>>>> /lib64/libc.so.6
> >>>>>>>>>>>>>>> #3  0x00007f5196176406 in memb_consensus_agreed
> >>>>>>>>>>>>>>> (instance=0x7f5196554010) at totemsrp.c:1194
> >>>>>>>>>>>>>>> #4  0x00007f519617b2f3 in memb_join_process 
> >>>>>>>>>>>>>>> (instance=0x7f5196554010,
> >>>>>>>>>>>>>>> memb_join=0x262f628) at totemsrp.c:3918
> >>>>>>>>>>>>>>> #5  0x00007f519617b619 in message_handler_memb_join
> >>>>>>>>>>>>>>> (instance=0x7f5196554010, msg=<value optimized out>, 
> >>>>>>>>>>>>>>> msg_len=<value
> >>>>>>>>>>>>>>> optimized out>, endian_conversion_needed=<value optimized 
> >>>>>>>>>>>>>>> out>)
> >>>>>>>>>>>>>>>     at totemsrp.c:4161
> >>>>>>>>>>>>>>> #6  0x00007f5196173ba7 in passive_mcast_recv 
> >>>>>>>>>>>>>>> (rrp_instance=0x2603030,
> >>>>>>>>>>>>>>> iface_no=0, context=<value optimized out>, msg=<value 
> >>>>>>>>>>>>>>> optimized out>,
> >>>>>>>>>>>>>>> msg_len=<value optimized out>) at totemrrp.c:720
> >>>>>>>>>>>>>>> #7  0x00007f5196172b44 in rrp_deliver_fn (context=<value 
> >>>>>>>>>>>>>>> optimized out>,
> >>>>>>>>>>>>>>> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> >>>>>>>>>>>>>>> #8  0x00007f5196171a76 in net_deliver_fn (handle=<value 
> >>>>>>>>>>>>>>> optimized out>,
> >>>>>>>>>>>>>>> fd=<value optimized out>, revents=<value optimized out>, 
> >>>>>>>>>>>>>>> data=0x262ef80)
> >>>>>>>>>>>>>>> at totemudp.c:1244
> >>>>>>>>>>>>>>> #9  0x00007f519616d7f2 in poll_run 
> >>>>>>>>>>>>>>> (handle=4858364909567606784) at
> >>>>>>>>>>>>>>> coropoll.c:510
> >>>>>>>>>>>>>>> #10 0x0000000000406add in main (argc=<value optimized out>, 
> >>>>>>>>>>>>>>> argv=<value
> >>>>>>>>>>>>>>> optimized out>, envp=<value optimized out>) at main.c:1680
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Last fplay lines are:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> rec=[36124] Log Message=Delivering MCAST message with seq 
> >>>>>>>>>>>>>>> 1366 to
> >>>>>>>>>>>>>>> pending delivery queue
> >>>>>>>>>>>>>>> rec=[36125] Log Message=Delivering MCAST message with seq 
> >>>>>>>>>>>>>>> 1367 to
> >>>>>>>>>>>>>>> pending delivery queue
> >>>>>>>>>>>>>>> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 
> >>>>>>>>>>>>>>> 1366
> >>>>>>>>>>>>>>> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 
> >>>>>>>>>>>>>>> 1367
> >>>>>>>>>>>>>>> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 
> >>>>>>>>>>>>>>> 1366
> >>>>>>>>>>>>>>> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 
> >>>>>>>>>>>>>>> 1367
> >>>>>>>>>>>>>>> rec=[36130] Log Message=releasing messages up to and 
> >>>>>>>>>>>>>>> including 1367
> >>>>>>>>>>>>>>> rec=[36131] Log Message=FAILED TO RECEIVE
> >>>>>>>>>>>>>>> rec=[36132] Log Message=entering GATHER state from 6.
> >>>>>>>>>>>>>>> rec=[36133] Log Message=entering GATHER state from 0.
> >>>>>>>>>>>>>>> Finishing replay: records found [33993]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What could be the reason for this? Bug, switches, memory 
> >>>>>>>>>>>>>>> errors?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The assertion fails because corosync finds out that
> >>>>>>>>>>>>>> instance->my_proc_list and instance->my_failed_list are
> >>>>>>>>>>>>>> equal. That happens immediately after the "FAILED TO RECEIVE"
> >>>>>>>>>>>>>> message which is issued when fail_recv_const token rotations
> >>>>>>>>>>>>>> happened without any multicast packet received (defaults to 
> >>>>>>>>>>>>>> 50).
> >>>>>>>>>>>>
> >>>>>>>>>>>> I took a look at the code and the protocol specification again
> >>>>>>>>>>>> and it seems like that assert is not valid since Steve patched
> >>>>>>>>>>>> the part dealing with the "FAILED TO RECEIVE" condition. The
> >>>>>>>>>>>> patch is from 2010-06-03 posted to the list here
> >>>>>>>>>>>> http://marc.info/?l=openais&m=127559807608484&w=2
> >>>>>>>>>>>>
> >>>>>>>>>>>> The last hunk of the patch contains this code (exec/totemsrp.c):
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3933         if (memb_consensus_agreed (instance) && 
> >>>>>>>>>>>> instance->failed_to_recv == 1)
> {
> >>  
> >>>>
> >>>>>>  
> >>>>>>>>>
> >>>>>>>>>>>> 3934                 instance->failed_to_recv = 0;
> >>>>>>>>>>>> 3935                 srp_addr_copy (&instance->my_proc_list[0],
> >>>>>>>>>>>> 3936                     &instance->my_id);
> >>>>>>>>>>>> 3937                 instance->my_proc_list_entries = 1;
> >>>>>>>>>>>> 3938                 instance->my_failed_list_entries = 0;
> >>>>>>>>>>>> 3939            
> >>>>>>>>>>>> 3940                 memb_state_commit_token_create (instance);
> >>>>>>>>>>>> 3941            
> >>>>>>>>>>>> 3942                 memb_state_commit_enter (instance);
> >>>>>>>>>>>> 3943                 return;
> >>>>>>>>>>>> 3944         }
> >>>>>>>>>>>>
> >>>>>>>>>>>> This code never got a chance to run because on failed_to_recv
> >>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which
> >>>>>>>>>>>> makes the assert fail in memb_consensus_agreed():
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1185     memb_set_subtract (token_memb, &token_memb_entries,
> >>>>>>>>>>>> 1186         instance->my_proc_list, 
> >>>>>>>>>>>> instance->my_proc_list_entries,
> >>>>>>>>>>>> 1187         instance->my_failed_list, 
> >>>>>>>>>>>> instance->my_failed_list_entries);
> >>>>>>>>>>>> ...
> >>>>>>>>>>>> 1195     assert (token_memb_entries >= 1);
> >>>>>>>>>>>>
> >>>>>>>>>>>> In other words, it's something like this:
> >>>>>>>>>>>>
> >>>>>>>>>>>>  if A:
> >>>>>>>>>>>>          if memb_consensus_agreed() and failed_to_recv:
> >>>>>>>>>>>>                  form a single node ring and try to recover
> >>>>>>>>>>>>
> >>>>>>>>>>>>  memb_consensus_agreed():
> >>>>>>>>>>>>          assert(!A)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Steve, can you take a look and confirm that this holds.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Dejan,
> >>>>>>>>>>>
> >>>>>>>>>>> sorry for delay in response - big backlog which is mostly cleared 
> >>>>>>>>>>> out :)
> >>>>>>>>>>
> >>>>>>>>>> No problem.
> >>>>>>>>>>
> >>>>>>>>>>> The assert definitely isn't correct, but removing it without 
> >>>>>>>>>>> addressing
> >>>>>>>>>>> the contents of the proc and fail lists is also not right.  That 
> >>>>>>>>>>> would
> >>>>>>>>>>> cause the logic in the if statement at line 3933 not to be 
> >>>>>>>>>>> executed
> >>>>>>>>>>> (because the first part of the if would evaluate to false)
> >>>>>>>>>>
> >>>>>>>>>> Actually it wouldn't. The agreed variable is set to 1 and it
> >>>>>>>>>> is going to be returned unchanged.
> >>>>>>>>>>
> >>>>>>>>>>> I believe
> >>>>>>>>>>> what we should do is check the "failed_to_recv" value in
> >>>>>>>>>>> memb_consensus_agreed instead of at line 3933.
> >>>>>>>>>>>
> >>>>>>>>>>> The issue with this is memb_state_consensus_timeout_expired which 
> >>>>>>>>>>> also
> >>>>>>>>>>> executes some 'then' logic where we may not want to execute the
> >>>>>>>>>>> failed_to_recv logic.
> >>>>>>>>>>
> >>>>>>>>>> Perhaps we should just
> >>>>>>>>>>
> >>>>>>>>>> 3933         if (instance->failed_to_recv == 1) {
> >>>>>>>>>>
> >>>>>>>>>> ? In case failed_to_recv both proc and fail lists are equal so
> >>>>>>>>>> checking for memb_consensus_agreed won't make sense, right?
> >>>>>>>>>>
> >>>>>>>>>>> If anyone has a reliable reproducer and can forward to me, I'll 
> >>>>>>>>>>> test out
> >>>>>>>>>>> a change to address this problem.  Really hesitant to change 
> >>>>>>>>>>> anything in
> >>>>>>>>>>> totemsrp without a test case for this problem - its almost 
> >>>>>>>>>>> perfect ;-)
> >>>>>>>>>>
> >>>>>>>>>> Since the tester upgraded the switch firmware they couldn't
> >>>>>>>>>> reproduce it anymore.
> >>>>>>>>>>
> >>>>>>>>>> Would compiling with these help?
> >>>>>>>>>>
> >>>>>>>>>> /*
> >>>>>>>>>>  * These can be used to test the error recovery algorithms
> >>>>>>>>>>  * #define TEST_DROP_ORF_TOKEN_PERCENTAGE 30
> >>>>>>>>>>  * #define TEST_DROP_COMMIT_TOKEN_PERCENTAGE 30
> >>>>>>>>>>  * #define TEST_DROP_MCAST_PERCENTAGE 50
> >>>>>>>>>>  * #define TEST_RECOVERY_MSG_COUNT 300
> >>>>>>>>>>  */
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Dejan
> >>>>>>>>>>
> >>>>>>>>>>> Regards
> >>>>>>>>>>> -steve
> >>>>>>>>>>>
> >>>>>>>>>>>> Dejan
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Openais mailing list
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Openais mailing list
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Openais mailing list
> >>>>>>>>>> [email protected]
> >>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Openais mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Openais mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Openais mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Openais mailing list
> >>>>> [email protected]
> >>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>>>
> >>>>
> >>>
> >>> _______________________________________________
> >>> Openais mailing list
> >>> [email protected]
> >>> https://lists.linux-foundation.org/mailman/listinfo/openais
> >>
> >>
> > 
> > _______________________________________________
> > Openais mailing list
> > [email protected]
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

Reply via email to