Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

Steven Dake Wed, 09 Feb 2011 17:21:44 -0800

Hideo,

Please use the suggested workaround posted earlier toda.  I have tested
the  patch I sent to the list previously and i does not fix the problem.


Regards
-steve

On 02/09/2011 05:41 PM, [email protected] wrote:
> Hi Steven,
> 
> Thank you for comment.
> 
>>> Is your patch 2 of the next?
>>>
>>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
>>> the token
>>>
>>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
>>> the token(take 2)
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>
>> Yes the take 2 version.
> 
> All right.
> Thanks!
> 
> Hideo Yamauch.
> 
> --- Steven Dake <[email protected]> wrote:
> 
>> On 02/08/2011 05:54 PM, [email protected] wrote:
>>> Hi Steven,
>>>
>>>> Have a try of the patch i have sent to this ml.  If the issue persists,
>>>> we can look at more options.
>>>
>>> Thank you for comment.
>>>
>>> Is your patch 2 of the next?
>>>
>>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
>>> the token
>>>
>>>  * [Openais] [PATCH] When a failed to recv state happens,stop forwarding 
>>> the token(take 2)
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>
>> Yes the take 2 version.
>>
>> Regards
>> -steve
>>
>>
>>>
>>>
>>> --- Steven Dake <[email protected]> wrote:
>>>
>>>> On 02/07/2011 11:11 PM, [email protected] wrote:
>>>>> Hi Steven,
>>>>>
>>>>> I understood your opinion by mistake.
>>>>>
>>>>> We do not have simple test case.
>>>>>
>>>>> The phenomenon generated in our environment is the following thing.
>>>>>
>>>>> Step 1) corosync constitutes a cluster in 12 nodes.
>>>>>  * begin communication in TOKEN
>>>>>
>>>>> Step 2) One node raises [FAILED TO RECEIVE].
>>>>>
>>>>> Step 3) 12 nodes begin the reconfiguration of the cluster again.
>>>>>
>>>>> Step 4) The node that occurred fails([FAILED TO RECEIVE]) in an consensus 
>>>>> of the JOIN
>>>> communication.
>>>>>  * Because the node failed in an consensus, node make contents of 
>>>>> faildlist and proclist
>> same.
>>>>>  * And this node compares faildlist with proclist and assert-fail 
>>>>> happened.
>>>>>
>>>>>
>>>>> When the node that made a cluster stood alone, I think that assert() is 
>>>>> unnecessary.
>>>>>
>>>>> Because the reason is because there is the next processing.
>>>>>
>>>>>
>>>>
>>>> Have a try of the patch i have sent to this ml.  If the issue persists,
>>>> we can look at more options.
>>>>
>>>> Thanks!
>>>> -steve
>>>>
>>>>
>>>>>
>>>>> static void memb_join_process (
>>>>>   struct totemsrp_instance *instance,
>>>>>   const struct memb_join *memb_join)
>>>>> {
>>>>>   struct srp_addr *proc_list;
>>>>>   struct srp_addr *failed_list;
>>>>> (snip)
>>>>>                           instance->failed_to_recv = 0;
>>>>>                           srp_addr_copy (&instance->my_proc_list[0],
>>>>>                                   &instance->my_id);
>>>>>                           instance->my_proc_list_entries = 1;
>>>>>                           instance->my_failed_list_entries = 0;
>>>>>
>>>>>                           memb_state_commit_token_create (instance);
>>>>>
>>>>>                           memb_state_commit_enter (instance);
>>>>>                           return;
>>>>>
>>>>> (snip)
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>>
>>>>> --- [email protected] wrote:
>>>>>
>>>>>> Hi Steven,
>>>>>>
>>>>>>> Hideo,
>>>>>>>
>>>>>>> If you have a test case, I can make a patch for you to try.
>>>>>>>
>>>>>>
>>>>>> All right.
>>>>>>
>>>>>> We use corosync.1.3.0.
>>>>>>
>>>>>> Please send me patch.
>>>>>>
>>>>>> Best Regards,
>>>>>> Hideo Yamauchi.
>>>>>>
>>>>>> --- Steven Dake <[email protected]> wrote:
>>>>>>
>>>>>>> On 02/06/2011 09:16 PM, [email protected] wrote:
>>>>>>>> Hi Steven,
>>>>>>>> Hi Dejan,
>>>>>>>>
>>>>>>>>>>>> This code never got a chance to run because on failed_to_recv
>>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which
>>>>>>>>>>>> makes the assert fail in memb_consensus_agreed():
>>>>>>>>
>>>>>>>> The same problem occurs, and we are troubled, too. 
>>>>>>>>
>>>>>>>> How did this argument turn out?
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Hideo Yamauchi.
>>>>>>>>
>>>>>>>
>>>>>>> Hideo,
>>>>>>>
>>>>>>> If you have a test case, I can make a patch for you to try.
>>>>>>>
>>>>>>> Regards
>>>>>>> -steve
>>>>>>>
>>>>>>>>
>>>>>>>> --- Dejan Muhamedagic <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> nudge, nudge
>>>>>>>>>
>>>>>>>>> On Wed, Jan 05, 2011 at 02:05:55PM +0100, Dejan Muhamedagic wrote:
>>>>>>>>>> On Tue, Jan 04, 2011 at 01:53:00PM -0700, Steven Dake wrote:
>>>>>>>>>>> On 12/23/2010 06:14 AM, Dejan Muhamedagic wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov wrote:
>>>>>>>>>>>>> 01.12.2010 16:32, Dejan Muhamedagic wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Hi Steven, hi all.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I often see this assert on one of nodes after I stop corosync 
>>>>>>>>>>>>>>> on some
>>>>>>>>>>>>>>> another node in newly-setup 4-node cluster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does the assert happen on a node lost event? Or once new
>>>>>>>>>>>>>> partition is formed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I first noticed it when I rebooted another node, just after 
>>>>>>>>>>>>> console said
>>>>>>>>>>>>> that OpenAIS is stopped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can't say right now, what exactly event did it follow, I'm 
>>>>>>>>>>>>> actually
>>>>>>>>>>>>> fighting with several problems with corosync, pacemaker, NFS4 and
>>>>>>>>>>>>> phantom uncorrectable ECC errors simultaneously and I'm a bit 
>>>>>>>>>>>>> lost with
>>>>>>>>>>>>> all of them.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #0  0x00007f51953e49a5 in raise () from /lib64/libc.so.6
>>>>>>>>>>>>>>> #1  0x00007f51953e6185 in abort () from /lib64/libc.so.6
>>>>>>>>>>>>>>> #2  0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
>>>>>>>>>>>>>>> #3  0x00007f5196176406 in memb_consensus_agreed
>>>>>>>>>>>>>>> (instance=0x7f5196554010) at totemsrp.c:1194
>>>>>>>>>>>>>>> #4  0x00007f519617b2f3 in memb_join_process 
>>>>>>>>>>>>>>> (instance=0x7f5196554010,
>>>>>>>>>>>>>>> memb_join=0x262f628) at totemsrp.c:3918
>>>>>>>>>>>>>>> #5  0x00007f519617b619 in message_handler_memb_join
>>>>>>>>>>>>>>> (instance=0x7f5196554010, msg=<value optimized out>, 
>>>>>>>>>>>>>>> msg_len=<value
>>>>>>>>>>>>>>> optimized out>, endian_conversion_needed=<value optimized out>)
>>>>>>>>>>>>>>>     at totemsrp.c:4161
>>>>>>>>>>>>>>> #6  0x00007f5196173ba7 in passive_mcast_recv 
>>>>>>>>>>>>>>> (rrp_instance=0x2603030,
>>>>>>>>>>>>>>> iface_no=0, context=<value optimized out>, msg=<value optimized 
>>>>>>>>>>>>>>> out>,
>>>>>>>>>>>>>>> msg_len=<value optimized out>) at totemrrp.c:720
>>>>>>>>>>>>>>> #7  0x00007f5196172b44 in rrp_deliver_fn (context=<value 
>>>>>>>>>>>>>>> optimized out>,
>>>>>>>>>>>>>>> msg=0x262f628, msg_len=420) at totemrrp.c:1404
>>>>>>>>>>>>>>> #8  0x00007f5196171a76 in net_deliver_fn (handle=<value 
>>>>>>>>>>>>>>> optimized out>,
>>>>>>>>>>>>>>> fd=<value optimized out>, revents=<value optimized out>, 
>>>>>>>>>>>>>>> data=0x262ef80)
>>>>>>>>>>>>>>> at totemudp.c:1244
>>>>>>>>>>>>>>> #9  0x00007f519616d7f2 in poll_run (handle=4858364909567606784) 
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> coropoll.c:510
>>>>>>>>>>>>>>> #10 0x0000000000406add in main (argc=<value optimized out>, 
>>>>>>>>>>>>>>> argv=<value
>>>>>>>>>>>>>>> optimized out>, envp=<value optimized out>) at main.c:1680
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Last fplay lines are:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rec=[36124] Log Message=Delivering MCAST message with seq 1366 
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> pending delivery queue
>>>>>>>>>>>>>>> rec=[36125] Log Message=Delivering MCAST message with seq 1367 
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> pending delivery queue
>>>>>>>>>>>>>>> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 
>>>>>>>>>>>>>>> 1366
>>>>>>>>>>>>>>> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 
>>>>>>>>>>>>>>> 1367
>>>>>>>>>>>>>>> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 
>>>>>>>>>>>>>>> 1366
>>>>>>>>>>>>>>> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 
>>>>>>>>>>>>>>> 1367
>>>>>>>>>>>>>>> rec=[36130] Log Message=releasing messages up to and including 
>>>>>>>>>>>>>>> 1367
>>>>>>>>>>>>>>> rec=[36131] Log Message=FAILED TO RECEIVE
>>>>>>>>>>>>>>> rec=[36132] Log Message=entering GATHER state from 6.
>>>>>>>>>>>>>>> rec=[36133] Log Message=entering GATHER state from 0.
>>>>>>>>>>>>>>> Finishing replay: records found [33993]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What could be the reason for this? Bug, switches, memory errors?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The assertion fails because corosync finds out that
>>>>>>>>>>>>>> instance->my_proc_list and instance->my_failed_list are
>>>>>>>>>>>>>> equal. That happens immediately after the "FAILED TO RECEIVE"
>>>>>>>>>>>>>> message which is issued when fail_recv_const token rotations
>>>>>>>>>>>>>> happened without any multicast packet received (defaults to 50).
>>>>>>>>>>>>
>>>>>>>>>>>> I took a look at the code and the protocol specification again
>>>>>>>>>>>> and it seems like that assert is not valid since Steve patched
>>>>>>>>>>>> the part dealing with the "FAILED TO RECEIVE" condition. The
>>>>>>>>>>>> patch is from 2010-06-03 posted to the list here
>>>>>>>>>>>> http://marc.info/?l=openais&m=127559807608484&w=2
>>>>>>>>>>>>
>>>>>>>>>>>> The last hunk of the patch contains this code (exec/totemsrp.c):
>>>>>>>>>>>>
>>>>>>>>>>>> 3933         if (memb_consensus_agreed (instance) && 
>>>>>>>>>>>> instance->failed_to_recv == 1) {
>>  
>>>>
>>>>>>  
>>>>>>>>>
>>>>>>>>>>>> 3934                 instance->failed_to_recv = 0;
>>>>>>>>>>>> 3935                 srp_addr_copy (&instance->my_proc_list[0],
>>>>>>>>>>>> 3936                     &instance->my_id);
>>>>>>>>>>>> 3937                 instance->my_proc_list_entries = 1;
>>>>>>>>>>>> 3938                 instance->my_failed_list_entries = 0;
>>>>>>>>>>>> 3939            
>>>>>>>>>>>> 3940                 memb_state_commit_token_create (instance);
>>>>>>>>>>>> 3941            
>>>>>>>>>>>> 3942                 memb_state_commit_enter (instance);
>>>>>>>>>>>> 3943                 return;
>>>>>>>>>>>> 3944         }
>>>>>>>>>>>>
>>>>>>>>>>>> This code never got a chance to run because on failed_to_recv
>>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which
>>>>>>>>>>>> makes the assert fail in memb_consensus_agreed():
>>>>>>>>>>>>
>>>>>>>>>>>> 1185     memb_set_subtract (token_memb, &token_memb_entries,
>>>>>>>>>>>> 1186         instance->my_proc_list, 
>>>>>>>>>>>> instance->my_proc_list_entries,
>>>>>>>>>>>> 1187         instance->my_failed_list, 
>>>>>>>>>>>> instance->my_failed_list_entries);
>>>>>>>>>>>> ...
>>>>>>>>>>>> 1195     assert (token_memb_entries >= 1);
>>>>>>>>>>>>
>>>>>>>>>>>> In other words, it's something like this:
>>>>>>>>>>>>
>>>>>>>>>>>>    if A:
>>>>>>>>>>>>            if memb_consensus_agreed() and failed_to_recv:
>>>>>>>>>>>>                    form a single node ring and try to recover
>>>>>>>>>>>>
>>>>>>>>>>>>    memb_consensus_agreed():
>>>>>>>>>>>>            assert(!A)
>>>>>>>>>>>>
>>>>>>>>>>>> Steve, can you take a look and confirm that this holds.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dejan,
>>>>>>>>>>>
>>>>>>>>>>> sorry for delay in response - big backlog which is mostly cleared 
>>>>>>>>>>> out :)
>>>>>>>>>>
>>>>>>>>>> No problem.
>>>>>>>>>>
>>>>>>>>>>> The assert definitely isn't correct, but removing it without 
>>>>>>>>>>> addressing
>>>>>>>>>>> the contents of the proc and fail lists is also not right.  That 
>>>>>>>>>>> would
>>>>>>>>>>> cause the logic in the if statement at line 3933 not to be executed
>>>>>>>>>>> (because the first part of the if would evaluate to false)
>>>>>>>>>>
>>>>>>>>>> Actually it wouldn't. The agreed variable is set to 1 and it
>>>>>>>>>> is going to be returned unchanged.
>>>>>>>>>>
>>>>>>>>>>> I believe
>>>>>>>>>>> what we should do is check the "failed_to_recv" value in
>>>>>>>>>>> memb_consensus_agreed instead of at line 3933.
>>>>>>>>>>>
>>>>>>>>>>> The issue with this is memb_state_consensus_timeout_expired which 
>>>>>>>>>>> also
>>>>>>>>>>> executes some 'then' logic where we may not want to execute the
>>>>>>>>>>> failed_to_recv logic.
>>>>>>>>>>
>>>>>>>>>> Perhaps we should just
>>>>>>>>>>
>>>>>>>>>> 3933         if (instance->failed_to_recv == 1) {
>>>>>>>>>>
>>>>>>>>>> ? In case failed_to_recv both proc and fail lists are equal so
>>>>>>>>>> checking for memb_consensus_agreed won't make sense, right?
>>>>>>>>>>
>>>>>>>>>>> If anyone has a reliable reproducer and can forward to me, I'll 
>>>>>>>>>>> test out
>>>>>>>>>>> a change to address this problem.  Really hesitant to change 
>>>>>>>>>>> anything in
>>>>>>>>>>> totemsrp without a test case for this problem - its almost perfect 
>>>>>>>>>>> ;-)
>>>>>>>>>>
>>>>>>>>>> Since the tester upgraded the switch firmware they couldn't
>>>>>>>>>> reproduce it anymore.
>>>>>>>>>>
>>>>>>>>>> Would compiling with these help?
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>>  * These can be used to test the error recovery algorithms
>>>>>>>>>>  * #define TEST_DROP_ORF_TOKEN_PERCENTAGE 30
>>>>>>>>>>  * #define TEST_DROP_COMMIT_TOKEN_PERCENTAGE 30
>>>>>>>>>>  * #define TEST_DROP_MCAST_PERCENTAGE 50
>>>>>>>>>>  * #define TEST_RECOVERY_MSG_COUNT 300
>>>>>>>>>>  */
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Dejan
>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> -steve
>>>>>>>>>>>
>>>>>>>>>>>> Dejan
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Openais mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Openais mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Openais mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>>>> _______________________________________________
>>>>>>>>> Openais mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Openais mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Openais mailing list
>>>>>> [email protected]
>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Openais mailing list
>>>>> [email protected]
>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Openais mailing list
>>> [email protected]
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Assert at totemsrp.c:1194 after FAILED TO RECEIVE

Reply via email to