On 08/14/2011 01:34 PM, Tim Beale wrote: > Hi Steve, > > I repeated the test with fail_recv_const=5000. I can see the CPG > client hung for ~4 minutes without dispatching any CPG events (i.e. > node join). Unfortunately, one of our healthchecking mechanisms kicked > in at this point, detected the CPG client as locked up and rebooted > the units. > > It definitely rules out #2. I can repeat the test with healthchecking > disabled to narrow down if #1 or #3 will occur. > > Regards, > Tim > > On Thu, Aug 11, 2011 at 4:21 AM, Steven Dake <[email protected]> wrote: >> On 08/09/2011 09:56 PM, Tim Beale wrote: >>> Hi Steve, >>> >>> Thanks for your patch. >>> >>> 1. I don't see the initial CLM leave events. But I still see the FAILED TO >>> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20, >>> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens. >>> Attached is the latest debug. >>> >> >> Keep in mind there are two problems here - (1) clm membership is wrong >> and (2) fail to recv problem. They are independent issues. >> >> I definitely want to look into this failed to receive issue. Can you >> try changing "fail_recv_const" on all the nodes to some large value, >> such as 5000? >> >> One of 3 things should happen: >> 1. the protocol blocks forever >> 2. the protocol enters operational after some short period >> 3. fail to recv is printed after a long period of time (1-10 minutes). >> >> Please report back which one happens with this tuning. >>
Given that #1/#3 are basically what are occurring, I would love to have a blackbox few seconds after config time and then couple minutes in. Apparently something is wrong with the recovery in this test case. Regards -steve >> >>> I think the problem is some nodes end up missing a message/sequence-number, >>> although I'm not sure exactly why. E.g. the token sequence starts off at one >>> when they enter operational, but not all nodes receive this. >>> 2011 Aug 9 10:07:18 daemon.debug node-3 corosync[1575]: [TOTEM ] >>> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1 >>> >>> The nodes that were still in recovery will be using different values for >>> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes >>> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE. >>> >>> The debug attached has your first memb-list patch popped off, but I've seen >>> the >>> same problem happen with it applied too. >>> >>> 2. Note that I don't see any CLM leave events at all now, even though after >>> the >>> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think >>> this >>> is due to the logic: >>> diff = my_new_memb_list - my_memb_list >> >> This isn't how the difference operation works. It produces a list of >> nodes that are not both in my_new_memb_list and my_memb_list, therefore, >> the current and logic should be correct. I wrote the patch at 2am and >> was quite tired, so I'll double check it is correct. >> >> Regards >> -steve >> >>> The diff doesn't include any nodes that are in my_memb_list but not in >>> my_new_memb_list, i.e. left nodes. I guess you could get all the differences >>> by doing the following: >>> memb_set_subtract( diff1, my_new_memb_list, my_memb_list ) >>> memb_set_subtract( diff2, my_memb_list, my_new_memb_list ) >>> memb_set_and( diff1, diff2, diff ) >>> >>> Thanks, >>> Tim >>> >>> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake <[email protected]> wrote: >>>> On 08/08/2011 12:10 AM, Tim Beale wrote: >>>>> Hi Steve, >>>>> >>>>> Thanks for your help. I tried out your patch but the problem still >>>>> occurs. The problem looks to me due to the ring-IDs used when forming >>>>> the transitional memb-list, rather than with the memb-list itself. The >>>>> ring-ID of the nodes still in Recovery is older than the rest of the >>>>> nodes who have already shifted to Operational. >>>>> >>>>> Attached is my attempt at fixing the problem. The idea is to delay the >>>>> nodes processing a Memb-Join immediately after shifting to >>>>> Operational, until the token has rotated the ring once. >>>>> >>>>> It doesn't quite work either though. The nodes are still re-entering >>>>> gather before all have left recovery. This time it's due to processing >>>>> a Merge-Detect message. One node has just started up and set itself to >>>>> the rep, and sends out a Merge-Detect which triggers the other nodes >>>>> to enter gather and reform the ring. >>>>> >>>>> Let me know if you have any other advice. >>>>> >>>> >>>> the problem is clear from the blackbox - 8 nodes enter operational while >>>> 1 in recovery is interrupted by a join message. this interrupted node >>>> then proceeds with a transitional membership of 1 node (which is correct). >>>> >>>> The joined and left lists use the transitional list to determine their >>>> contents, which is not correct. This results in incorrect data >>>> delivered to clm. Try the follow-up patch which should correctly >>>> calculate the joined and left lists. >>>> >>>> >>>>> Thanks, >>>>> Tim >>>>> >>>>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <[email protected]> wrote: >>>>>> On 08/03/2011 10:32 PM, Tim Beale wrote: >>>>>>> Hi, >>>>>>> >>>>>>> It looks to me that the way the transition from Recovery to Operational >>>>>>> works, >>>>>>> we can't guarantee that all nodes in the ring have entered Operational >>>>>>> before >>>>>>> a node processes another Memb-Join message from a new node. E.g. we >>>>>>> can't >>>>>>> guarantee the token has rotated right the way around the ring. >>>>>>> >>>>>>> When this happens, the nodes still in Recovery will still use the older >>>>>>> ring >>>>>>> ID. So they won't get added to the transitional membership, and CLM >>>>>>> will report >>>>>>> leave events for these nodes. (Plus there might be other side-effects, >>>>>>> like the >>>>>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's >>>>>>> happening). >>>>>>> >>>>>> >>>>>> Thanks for the pointer here - patch on ml. >>>>>> >>>>>>> We are currently using CLM to check the health of a node, i.e. so we >>>>>>> can detect >>>>>>> if it locks up. My questions are: >>>>>>> i) Are there config settings we could change to improve this, like >>>>>>> increasing >>>>>>> the 'join' timeout? >>>>>>> ii) Should I try to make a code change to fix the problem? E.g. delay >>>>>>> processing the Memb-Join message if the node's only just entered >>>>>>> operational. >>>>>>> iii) Should we not be using CLM like this? I.e. should we just learn to >>>>>>> live >>>>>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly >>>>>>> healthy. >>>>>>> >>>>>>> Thanks for your help. >>>>>>> Tim >>>>>>> >>>>>> >>>>>> Tim please try the patch I have recently posted: >>>>>> [PATCH] Set my_new_memb_list in recovery enter >>>>>> >>>>>> First and foremost, let me know if it resolves your 10 node startup case >>>>>> which fails 10% of the time. Then let me know if it treats other >>>>>> symptoms. >>>>>> >>>>>> Regards >>>>>> -steve >>>>>> >>>>>> >>>>>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <[email protected]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> We're booting up a 10-node cluster (with all nodes starting corosync >>>>>>>> at roughly >>>>>>>> the same time) and approx 1 in 10 times we see some problems: >>>>>>>> a) CLM is reporting nodes as leaving and then immediately rejoining >>>>>>>> (not sure >>>>>>>> if this is valid behaviour?) >>>>>>>> b) Probably an unrelated oddity, but we're getting flow control >>>>>>>> enabled on a >>>>>>>> client daemon using CLM that's only sending one request >>>>>>>> (saClmClusterTrack()). >>>>>>>> c) A node is hitting the FAILED TO RECEIVE case >>>>>>>> d) After c) there seems to be a lot of churn as the cluster tries to >>>>>>>> reform >>>>>>>> e) During the processing of node leave events, the CPG client can >>>>>>>> sometimes get >>>>>>>> broken so it no longer processes *any* CPG events >>>>>>>> >>>>>>>> Corosync debug is attached (I commented out some of the noisier debug >>>>>>>> around >>>>>>>> message delivery). We don't really know enough about corosync to tell >>>>>>>> what >>>>>>>> exactly is incorrect behaviour and what should be fixed. But here's >>>>>>>> what we've >>>>>>>> noticed: >>>>>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except >>>>>>>> node-12 >>>>>>>> have entered operational state (see node-12.txt line 235). It looks >>>>>>>> like maybe >>>>>>>> node-12 hasn't received enough rotations of the token to enter >>>>>>>> operational yet. >>>>>>>> Node-12's resulting transitional config consists of just itself. All >>>>>>>> nodes then >>>>>>>> report node-1 and node-12 as leaving and immediately rejoining. >>>>>>>> 2) After this config change, node-3 eventually hits the FAILED TO >>>>>>>> RECEIVE case >>>>>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU >>>>>>>> matching >>>>>>>> the high_seq_received, all other nodes have an ARU of zero. >>>>>>>> 3) Node-3 entering gather seems to result in a lot of config change >>>>>>>> churn >>>>>>>> across the cluster. >>>>>>>> 4) While processing the config changes on node-3, the CPG downlist it >>>>>>>> uses >>>>>>>> contains itself. When node-3 sends leave events for the nodes in the >>>>>>>> downlist >>>>>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED >>>>>>>> and clears >>>>>>>> the cpd->group_name. This means it no longer sends any CPG events to >>>>>>>> the CPG >>>>>>>> client. >>>>>>>> >>>>>>>> We tried cherry-picking this commit to fix the problem (#4) with the >>>>>>>> CPG client. >>>>>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577 >>>>>>>> It helped a bit, but didn't fix it completely. We've made an interim >>>>>>>> change >>>>>>>> (attached) to avoid this problem. >>>>>>>> >>>>>>>> We're using corosync v1.3.1 on an embedded linux system (with a >>>>>>>> low-spec CPU). >>>>>>>> Corosync is running over a basic ethernet interface (no >>>>>>>> hubs/routers/etc). >>>>>>>> >>>>>>>> Any help would be appreciated. Let me know if there's any other debug >>>>>>>> I can >>>>>>>> provide. >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> Thanks, >>>>>>>> Tim >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Openais mailing list >>>>>>> [email protected] >>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais >>>>>> >>>>>> >>>> >>>> >> >> _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
