Hi Steven, > Please use the suggested workaround posted earlier toda.
Where are these contents? Will you please teach a link? > I have tested the patch I sent to the list previously and i does not fix the > problem. All right. We shelve the test of this patch. Best Regards, Hideo Yamauchi. --- Steven Dake <[email protected]> wrote: > Hideo, > > Please use the suggested workaround posted earlier toda. I have tested > the patch I sent to the list previously and i does not fix the problem. > > Regards > -steve > > On 02/09/2011 05:41 PM, [email protected] wrote: > > Hi Steven, > > > > Thank you for comment. > > > >>> Is your patch 2 of the next? > >>> > >>> * [Openais] [PATCH] When a failed to recv state happens,stop forwarding > >>> the token > >>> > >>> * [Openais] [PATCH] When a failed to recv state happens,stop forwarding > >>> the token(take 2) > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >> > >> Yes the take 2 version. > > > > All right. > > Thanks! > > > > Hideo Yamauch. > > > > --- Steven Dake <[email protected]> wrote: > > > >> On 02/08/2011 05:54 PM, [email protected] wrote: > >>> Hi Steven, > >>> > >>>> Have a try of the patch i have sent to this ml. If the issue persists, > >>>> we can look at more options. > >>> > >>> Thank you for comment. > >>> > >>> Is your patch 2 of the next? > >>> > >>> * [Openais] [PATCH] When a failed to recv state happens,stop forwarding > >>> the token > >>> > >>> * [Openais] [PATCH] When a failed to recv state happens,stop forwarding > >>> the token(take 2) > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >> > >> Yes the take 2 version. > >> > >> Regards > >> -steve > >> > >> > >>> > >>> > >>> --- Steven Dake <[email protected]> wrote: > >>> > >>>> On 02/07/2011 11:11 PM, [email protected] wrote: > >>>>> Hi Steven, > >>>>> > >>>>> I understood your opinion by mistake. > >>>>> > >>>>> We do not have simple test case. > >>>>> > >>>>> The phenomenon generated in our environment is the following thing. > >>>>> > >>>>> Step 1) corosync constitutes a cluster in 12 nodes. > >>>>> * begin communication in TOKEN > >>>>> > >>>>> Step 2) One node raises [FAILED TO RECEIVE]. > >>>>> > >>>>> Step 3) 12 nodes begin the reconfiguration of the cluster again. > >>>>> > >>>>> Step 4) The node that occurred fails([FAILED TO RECEIVE]) in an > >>>>> consensus of the JOIN > >>>> communication. > >>>>> * Because the node failed in an consensus, node make contents of > >>>>> faildlist and proclist > >> same. > >>>>> * And this node compares faildlist with proclist and assert-fail > >>>>> happened. > >>>>> > >>>>> > >>>>> When the node that made a cluster stood alone, I think that assert() is > >>>>> unnecessary. > >>>>> > >>>>> Because the reason is because there is the next processing. > >>>>> > >>>>> > >>>> > >>>> Have a try of the patch i have sent to this ml. If the issue persists, > >>>> we can look at more options. > >>>> > >>>> Thanks! > >>>> -steve > >>>> > >>>> > >>>>> > >>>>> static void memb_join_process ( > >>>>> struct totemsrp_instance *instance, > >>>>> const struct memb_join *memb_join) > >>>>> { > >>>>> struct srp_addr *proc_list; > >>>>> struct srp_addr *failed_list; > >>>>> (snip) > >>>>> instance->failed_to_recv = 0; > >>>>> srp_addr_copy > >>>>> (&instance->my_proc_list[0], > >>>>> &instance->my_id); > >>>>> instance->my_proc_list_entries = 1; > >>>>> instance->my_failed_list_entries = 0; > >>>>> > >>>>> memb_state_commit_token_create > >>>>> (instance); > >>>>> > >>>>> memb_state_commit_enter (instance); > >>>>> return; > >>>>> > >>>>> (snip) > >>>>> > >>>>> Best Regards, > >>>>> Hideo Yamauchi. > >>>>> > >>>>> > >>>>> > >>>>> --- [email protected] wrote: > >>>>> > >>>>>> Hi Steven, > >>>>>> > >>>>>>> Hideo, > >>>>>>> > >>>>>>> If you have a test case, I can make a patch for you to try. > >>>>>>> > >>>>>> > >>>>>> All right. > >>>>>> > >>>>>> We use corosync.1.3.0. > >>>>>> > >>>>>> Please send me patch. > >>>>>> > >>>>>> Best Regards, > >>>>>> Hideo Yamauchi. > >>>>>> > >>>>>> --- Steven Dake <[email protected]> wrote: > >>>>>> > >>>>>>> On 02/06/2011 09:16 PM, [email protected] wrote: > >>>>>>>> Hi Steven, > >>>>>>>> Hi Dejan, > >>>>>>>> > >>>>>>>>>>>> This code never got a chance to run because on failed_to_recv > >>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which > >>>>>>>>>>>> makes the assert fail in memb_consensus_agreed(): > >>>>>>>> > >>>>>>>> The same problem occurs, and we are troubled, too. > >>>>>>>> > >>>>>>>> How did this argument turn out? > >>>>>>>> > >>>>>>>> Best Regards, > >>>>>>>> Hideo Yamauchi. > >>>>>>>> > >>>>>>> > >>>>>>> Hideo, > >>>>>>> > >>>>>>> If you have a test case, I can make a patch for you to try. > >>>>>>> > >>>>>>> Regards > >>>>>>> -steve > >>>>>>> > >>>>>>>> > >>>>>>>> --- Dejan Muhamedagic <[email protected]> wrote: > >>>>>>>> > >>>>>>>>> nudge, nudge > >>>>>>>>> > >>>>>>>>> On Wed, Jan 05, 2011 at 02:05:55PM +0100, Dejan Muhamedagic wrote: > >>>>>>>>>> On Tue, Jan 04, 2011 at 01:53:00PM -0700, Steven Dake wrote: > >>>>>>>>>>> On 12/23/2010 06:14 AM, Dejan Muhamedagic wrote: > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Dec 01, 2010 at 05:30:44PM +0200, Vladislav Bogdanov > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> 01.12.2010 16:32, Dejan Muhamedagic wrote: > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Nov 23, 2010 at 12:53:42PM +0200, Vladislav Bogdanov > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> Hi Steven, hi all. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I often see this assert on one of nodes after I stop corosync > >>>>>>>>>>>>>>> on some > >>>>>>>>>>>>>>> another node in newly-setup 4-node cluster. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Does the assert happen on a node lost event? Or once new > >>>>>>>>>>>>>> partition is formed? > >>>>>>>>>>>>> > >>>>>>>>>>>>> I first noticed it when I rebooted another node, just after > >>>>>>>>>>>>> console said > >>>>>>>>>>>>> that OpenAIS is stopped. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Can't say right now, what exactly event did it follow, I'm > >>>>>>>>>>>>> actually > >>>>>>>>>>>>> fighting with several problems with corosync, pacemaker, NFS4 > >>>>>>>>>>>>> and > >>>>>>>>>>>>> phantom uncorrectable ECC errors simultaneously and I'm a bit > >>>>>>>>>>>>> lost with > >>>>>>>>>>>>> all of them. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6 > >>>>>>>>>>>>>>> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6 > >>>>>>>>>>>>>>> #2 0x00007f51953dd935 in __assert_fail () from > >>>>>>>>>>>>>>> /lib64/libc.so.6 > >>>>>>>>>>>>>>> #3 0x00007f5196176406 in memb_consensus_agreed > >>>>>>>>>>>>>>> (instance=0x7f5196554010) at totemsrp.c:1194 > >>>>>>>>>>>>>>> #4 0x00007f519617b2f3 in memb_join_process > >>>>>>>>>>>>>>> (instance=0x7f5196554010, > >>>>>>>>>>>>>>> memb_join=0x262f628) at totemsrp.c:3918 > >>>>>>>>>>>>>>> #5 0x00007f519617b619 in message_handler_memb_join > >>>>>>>>>>>>>>> (instance=0x7f5196554010, msg=<value optimized out>, > >>>>>>>>>>>>>>> msg_len=<value > >>>>>>>>>>>>>>> optimized out>, endian_conversion_needed=<value optimized > >>>>>>>>>>>>>>> out>) > >>>>>>>>>>>>>>> at totemsrp.c:4161 > >>>>>>>>>>>>>>> #6 0x00007f5196173ba7 in passive_mcast_recv > >>>>>>>>>>>>>>> (rrp_instance=0x2603030, > >>>>>>>>>>>>>>> iface_no=0, context=<value optimized out>, msg=<value > >>>>>>>>>>>>>>> optimized out>, > >>>>>>>>>>>>>>> msg_len=<value optimized out>) at totemrrp.c:720 > >>>>>>>>>>>>>>> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value > >>>>>>>>>>>>>>> optimized out>, > >>>>>>>>>>>>>>> msg=0x262f628, msg_len=420) at totemrrp.c:1404 > >>>>>>>>>>>>>>> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value > >>>>>>>>>>>>>>> optimized out>, > >>>>>>>>>>>>>>> fd=<value optimized out>, revents=<value optimized out>, > >>>>>>>>>>>>>>> data=0x262ef80) > >>>>>>>>>>>>>>> at totemudp.c:1244 > >>>>>>>>>>>>>>> #9 0x00007f519616d7f2 in poll_run > >>>>>>>>>>>>>>> (handle=4858364909567606784) at > >>>>>>>>>>>>>>> coropoll.c:510 > >>>>>>>>>>>>>>> #10 0x0000000000406add in main (argc=<value optimized out>, > >>>>>>>>>>>>>>> argv=<value > >>>>>>>>>>>>>>> optimized out>, envp=<value optimized out>) at main.c:1680 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Last fplay lines are: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> rec=[36124] Log Message=Delivering MCAST message with seq > >>>>>>>>>>>>>>> 1366 to > >>>>>>>>>>>>>>> pending delivery queue > >>>>>>>>>>>>>>> rec=[36125] Log Message=Delivering MCAST message with seq > >>>>>>>>>>>>>>> 1367 to > >>>>>>>>>>>>>>> pending delivery queue > >>>>>>>>>>>>>>> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq > >>>>>>>>>>>>>>> 1366 > >>>>>>>>>>>>>>> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq > >>>>>>>>>>>>>>> 1367 > >>>>>>>>>>>>>>> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq > >>>>>>>>>>>>>>> 1366 > >>>>>>>>>>>>>>> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq > >>>>>>>>>>>>>>> 1367 > >>>>>>>>>>>>>>> rec=[36130] Log Message=releasing messages up to and > >>>>>>>>>>>>>>> including 1367 > >>>>>>>>>>>>>>> rec=[36131] Log Message=FAILED TO RECEIVE > >>>>>>>>>>>>>>> rec=[36132] Log Message=entering GATHER state from 6. > >>>>>>>>>>>>>>> rec=[36133] Log Message=entering GATHER state from 0. > >>>>>>>>>>>>>>> Finishing replay: records found [33993] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> What could be the reason for this? Bug, switches, memory > >>>>>>>>>>>>>>> errors? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The assertion fails because corosync finds out that > >>>>>>>>>>>>>> instance->my_proc_list and instance->my_failed_list are > >>>>>>>>>>>>>> equal. That happens immediately after the "FAILED TO RECEIVE" > >>>>>>>>>>>>>> message which is issued when fail_recv_const token rotations > >>>>>>>>>>>>>> happened without any multicast packet received (defaults to > >>>>>>>>>>>>>> 50). > >>>>>>>>>>>> > >>>>>>>>>>>> I took a look at the code and the protocol specification again > >>>>>>>>>>>> and it seems like that assert is not valid since Steve patched > >>>>>>>>>>>> the part dealing with the "FAILED TO RECEIVE" condition. The > >>>>>>>>>>>> patch is from 2010-06-03 posted to the list here > >>>>>>>>>>>> http://marc.info/?l=openais&m=127559807608484&w=2 > >>>>>>>>>>>> > >>>>>>>>>>>> The last hunk of the patch contains this code (exec/totemsrp.c): > >>>>>>>>>>>> > >>>>>>>>>>>> 3933 if (memb_consensus_agreed (instance) && > >>>>>>>>>>>> instance->failed_to_recv == 1) > { > >> > >>>> > >>>>>> > >>>>>>>>> > >>>>>>>>>>>> 3934 instance->failed_to_recv = 0; > >>>>>>>>>>>> 3935 srp_addr_copy (&instance->my_proc_list[0], > >>>>>>>>>>>> 3936 &instance->my_id); > >>>>>>>>>>>> 3937 instance->my_proc_list_entries = 1; > >>>>>>>>>>>> 3938 instance->my_failed_list_entries = 0; > >>>>>>>>>>>> 3939 > >>>>>>>>>>>> 3940 memb_state_commit_token_create (instance); > >>>>>>>>>>>> 3941 > >>>>>>>>>>>> 3942 memb_state_commit_enter (instance); > >>>>>>>>>>>> 3943 return; > >>>>>>>>>>>> 3944 } > >>>>>>>>>>>> > >>>>>>>>>>>> This code never got a chance to run because on failed_to_recv > >>>>>>>>>>>> the two sets (my_process_list and my_failed_list) are equal which > >>>>>>>>>>>> makes the assert fail in memb_consensus_agreed(): > >>>>>>>>>>>> > >>>>>>>>>>>> 1185 memb_set_subtract (token_memb, &token_memb_entries, > >>>>>>>>>>>> 1186 instance->my_proc_list, > >>>>>>>>>>>> instance->my_proc_list_entries, > >>>>>>>>>>>> 1187 instance->my_failed_list, > >>>>>>>>>>>> instance->my_failed_list_entries); > >>>>>>>>>>>> ... > >>>>>>>>>>>> 1195 assert (token_memb_entries >= 1); > >>>>>>>>>>>> > >>>>>>>>>>>> In other words, it's something like this: > >>>>>>>>>>>> > >>>>>>>>>>>> if A: > >>>>>>>>>>>> if memb_consensus_agreed() and failed_to_recv: > >>>>>>>>>>>> form a single node ring and try to recover > >>>>>>>>>>>> > >>>>>>>>>>>> memb_consensus_agreed(): > >>>>>>>>>>>> assert(!A) > >>>>>>>>>>>> > >>>>>>>>>>>> Steve, can you take a look and confirm that this holds. > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers, > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Dejan, > >>>>>>>>>>> > >>>>>>>>>>> sorry for delay in response - big backlog which is mostly cleared > >>>>>>>>>>> out :) > >>>>>>>>>> > >>>>>>>>>> No problem. > >>>>>>>>>> > >>>>>>>>>>> The assert definitely isn't correct, but removing it without > >>>>>>>>>>> addressing > >>>>>>>>>>> the contents of the proc and fail lists is also not right. That > >>>>>>>>>>> would > >>>>>>>>>>> cause the logic in the if statement at line 3933 not to be > >>>>>>>>>>> executed > >>>>>>>>>>> (because the first part of the if would evaluate to false) > >>>>>>>>>> > >>>>>>>>>> Actually it wouldn't. The agreed variable is set to 1 and it > >>>>>>>>>> is going to be returned unchanged. > >>>>>>>>>> > >>>>>>>>>>> I believe > >>>>>>>>>>> what we should do is check the "failed_to_recv" value in > >>>>>>>>>>> memb_consensus_agreed instead of at line 3933. > >>>>>>>>>>> > >>>>>>>>>>> The issue with this is memb_state_consensus_timeout_expired which > >>>>>>>>>>> also > >>>>>>>>>>> executes some 'then' logic where we may not want to execute the > >>>>>>>>>>> failed_to_recv logic. > >>>>>>>>>> > >>>>>>>>>> Perhaps we should just > >>>>>>>>>> > >>>>>>>>>> 3933 if (instance->failed_to_recv == 1) { > >>>>>>>>>> > >>>>>>>>>> ? In case failed_to_recv both proc and fail lists are equal so > >>>>>>>>>> checking for memb_consensus_agreed won't make sense, right? > >>>>>>>>>> > >>>>>>>>>>> If anyone has a reliable reproducer and can forward to me, I'll > >>>>>>>>>>> test out > >>>>>>>>>>> a change to address this problem. Really hesitant to change > >>>>>>>>>>> anything in > >>>>>>>>>>> totemsrp without a test case for this problem - its almost > >>>>>>>>>>> perfect ;-) > >>>>>>>>>> > >>>>>>>>>> Since the tester upgraded the switch firmware they couldn't > >>>>>>>>>> reproduce it anymore. > >>>>>>>>>> > >>>>>>>>>> Would compiling with these help? > >>>>>>>>>> > >>>>>>>>>> /* > >>>>>>>>>> * These can be used to test the error recovery algorithms > >>>>>>>>>> * #define TEST_DROP_ORF_TOKEN_PERCENTAGE 30 > >>>>>>>>>> * #define TEST_DROP_COMMIT_TOKEN_PERCENTAGE 30 > >>>>>>>>>> * #define TEST_DROP_MCAST_PERCENTAGE 50 > >>>>>>>>>> * #define TEST_RECOVERY_MSG_COUNT 300 > >>>>>>>>>> */ > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> Dejan > >>>>>>>>>> > >>>>>>>>>>> Regards > >>>>>>>>>>> -steve > >>>>>>>>>>> > >>>>>>>>>>>> Dejan > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Openais mailing list > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> Openais mailing list > >>>>>>>>>>> [email protected] > >>>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Openais mailing list > >>>>>>>>>> [email protected] > >>>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>>>>> _______________________________________________ > >>>>>>>>> Openais mailing list > >>>>>>>>> [email protected] > >>>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Openais mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Openais mailing list > >>>>>> [email protected] > >>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Openais mailing list > >>>>> [email protected] > >>>>> https://lists.linux-foundation.org/mailman/listinfo/openais > >>>> > >>>> > >>> > >>> _______________________________________________ > >>> Openais mailing list > >>> [email protected] > >>> https://lists.linux-foundation.org/mailman/listinfo/openais > >> > >> > > > > _______________________________________________ > > Openais mailing list > > [email protected] > > https://lists.linux-foundation.org/mailman/listinfo/openais > > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
