Hi Steve, I repeated the test with fail_recv_const=5000. I can see the CPG client hung for ~4 minutes without dispatching any CPG events (i.e. node join). Unfortunately, one of our healthchecking mechanisms kicked in at this point, detected the CPG client as locked up and rebooted the units.
It definitely rules out #2. I can repeat the test with healthchecking disabled to narrow down if #1 or #3 will occur. Regards, Tim On Thu, Aug 11, 2011 at 4:21 AM, Steven Dake <[email protected]> wrote: > On 08/09/2011 09:56 PM, Tim Beale wrote: >> Hi Steve, >> >> Thanks for your patch. >> >> 1. I don't see the initial CLM leave events. But I still see the FAILED TO >> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20, >> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens. >> Attached is the latest debug. >> > > Keep in mind there are two problems here - (1) clm membership is wrong > and (2) fail to recv problem. They are independent issues. > > I definitely want to look into this failed to receive issue. Can you > try changing "fail_recv_const" on all the nodes to some large value, > such as 5000? > > One of 3 things should happen: > 1. the protocol blocks forever > 2. the protocol enters operational after some short period > 3. fail to recv is printed after a long period of time (1-10 minutes). > > Please report back which one happens with this tuning. > > >> I think the problem is some nodes end up missing a message/sequence-number, >> although I'm not sure exactly why. E.g. the token sequence starts off at one >> when they enter operational, but not all nodes receive this. >> 2011 Aug 9 10:07:18 daemon.debug node-3 corosync[1575]: [TOTEM ] >> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1 >> >> The nodes that were still in recovery will be using different values for >> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes >> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE. >> >> The debug attached has your first memb-list patch popped off, but I've seen >> the >> same problem happen with it applied too. >> >> 2. Note that I don't see any CLM leave events at all now, even though after >> the >> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think this >> is due to the logic: >> diff = my_new_memb_list - my_memb_list > > This isn't how the difference operation works. It produces a list of > nodes that are not both in my_new_memb_list and my_memb_list, therefore, > the current and logic should be correct. I wrote the patch at 2am and > was quite tired, so I'll double check it is correct. > > Regards > -steve > >> The diff doesn't include any nodes that are in my_memb_list but not in >> my_new_memb_list, i.e. left nodes. I guess you could get all the differences >> by doing the following: >> memb_set_subtract( diff1, my_new_memb_list, my_memb_list ) >> memb_set_subtract( diff2, my_memb_list, my_new_memb_list ) >> memb_set_and( diff1, diff2, diff ) >> >> Thanks, >> Tim >> >> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake <[email protected]> wrote: >>> On 08/08/2011 12:10 AM, Tim Beale wrote: >>>> Hi Steve, >>>> >>>> Thanks for your help. I tried out your patch but the problem still >>>> occurs. The problem looks to me due to the ring-IDs used when forming >>>> the transitional memb-list, rather than with the memb-list itself. The >>>> ring-ID of the nodes still in Recovery is older than the rest of the >>>> nodes who have already shifted to Operational. >>>> >>>> Attached is my attempt at fixing the problem. The idea is to delay the >>>> nodes processing a Memb-Join immediately after shifting to >>>> Operational, until the token has rotated the ring once. >>>> >>>> It doesn't quite work either though. The nodes are still re-entering >>>> gather before all have left recovery. This time it's due to processing >>>> a Merge-Detect message. One node has just started up and set itself to >>>> the rep, and sends out a Merge-Detect which triggers the other nodes >>>> to enter gather and reform the ring. >>>> >>>> Let me know if you have any other advice. >>>> >>> >>> the problem is clear from the blackbox - 8 nodes enter operational while >>> 1 in recovery is interrupted by a join message. this interrupted node >>> then proceeds with a transitional membership of 1 node (which is correct). >>> >>> The joined and left lists use the transitional list to determine their >>> contents, which is not correct. This results in incorrect data >>> delivered to clm. Try the follow-up patch which should correctly >>> calculate the joined and left lists. >>> >>> >>>> Thanks, >>>> Tim >>>> >>>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <[email protected]> wrote: >>>>> On 08/03/2011 10:32 PM, Tim Beale wrote: >>>>>> Hi, >>>>>> >>>>>> It looks to me that the way the transition from Recovery to Operational >>>>>> works, >>>>>> we can't guarantee that all nodes in the ring have entered Operational >>>>>> before >>>>>> a node processes another Memb-Join message from a new node. E.g. we can't >>>>>> guarantee the token has rotated right the way around the ring. >>>>>> >>>>>> When this happens, the nodes still in Recovery will still use the older >>>>>> ring >>>>>> ID. So they won't get added to the transitional membership, and CLM will >>>>>> report >>>>>> leave events for these nodes. (Plus there might be other side-effects, >>>>>> like the >>>>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's >>>>>> happening). >>>>>> >>>>> >>>>> Thanks for the pointer here - patch on ml. >>>>> >>>>>> We are currently using CLM to check the health of a node, i.e. so we can >>>>>> detect >>>>>> if it locks up. My questions are: >>>>>> i) Are there config settings we could change to improve this, like >>>>>> increasing >>>>>> the 'join' timeout? >>>>>> ii) Should I try to make a code change to fix the problem? E.g. delay >>>>>> processing the Memb-Join message if the node's only just entered >>>>>> operational. >>>>>> iii) Should we not be using CLM like this? I.e. should we just learn to >>>>>> live >>>>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly >>>>>> healthy. >>>>>> >>>>>> Thanks for your help. >>>>>> Tim >>>>>> >>>>> >>>>> Tim please try the patch I have recently posted: >>>>> [PATCH] Set my_new_memb_list in recovery enter >>>>> >>>>> First and foremost, let me know if it resolves your 10 node startup case >>>>> which fails 10% of the time. Then let me know if it treats other >>>>> symptoms. >>>>> >>>>> Regards >>>>> -steve >>>>> >>>>> >>>>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <[email protected]> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> We're booting up a 10-node cluster (with all nodes starting corosync at >>>>>>> roughly >>>>>>> the same time) and approx 1 in 10 times we see some problems: >>>>>>> a) CLM is reporting nodes as leaving and then immediately rejoining >>>>>>> (not sure >>>>>>> if this is valid behaviour?) >>>>>>> b) Probably an unrelated oddity, but we're getting flow control enabled >>>>>>> on a >>>>>>> client daemon using CLM that's only sending one request >>>>>>> (saClmClusterTrack()). >>>>>>> c) A node is hitting the FAILED TO RECEIVE case >>>>>>> d) After c) there seems to be a lot of churn as the cluster tries to >>>>>>> reform >>>>>>> e) During the processing of node leave events, the CPG client can >>>>>>> sometimes get >>>>>>> broken so it no longer processes *any* CPG events >>>>>>> >>>>>>> Corosync debug is attached (I commented out some of the noisier debug >>>>>>> around >>>>>>> message delivery). We don't really know enough about corosync to tell >>>>>>> what >>>>>>> exactly is incorrect behaviour and what should be fixed. But here's >>>>>>> what we've >>>>>>> noticed: >>>>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except >>>>>>> node-12 >>>>>>> have entered operational state (see node-12.txt line 235). It looks >>>>>>> like maybe >>>>>>> node-12 hasn't received enough rotations of the token to enter >>>>>>> operational yet. >>>>>>> Node-12's resulting transitional config consists of just itself. All >>>>>>> nodes then >>>>>>> report node-1 and node-12 as leaving and immediately rejoining. >>>>>>> 2) After this config change, node-3 eventually hits the FAILED TO >>>>>>> RECEIVE case >>>>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU >>>>>>> matching >>>>>>> the high_seq_received, all other nodes have an ARU of zero. >>>>>>> 3) Node-3 entering gather seems to result in a lot of config change >>>>>>> churn >>>>>>> across the cluster. >>>>>>> 4) While processing the config changes on node-3, the CPG downlist it >>>>>>> uses >>>>>>> contains itself. When node-3 sends leave events for the nodes in the >>>>>>> downlist >>>>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and >>>>>>> clears >>>>>>> the cpd->group_name. This means it no longer sends any CPG events to >>>>>>> the CPG >>>>>>> client. >>>>>>> >>>>>>> We tried cherry-picking this commit to fix the problem (#4) with the >>>>>>> CPG client. >>>>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577 >>>>>>> It helped a bit, but didn't fix it completely. We've made an interim >>>>>>> change >>>>>>> (attached) to avoid this problem. >>>>>>> >>>>>>> We're using corosync v1.3.1 on an embedded linux system (with a >>>>>>> low-spec CPU). >>>>>>> Corosync is running over a basic ethernet interface (no >>>>>>> hubs/routers/etc). >>>>>>> >>>>>>> Any help would be appreciated. Let me know if there's any other debug I >>>>>>> can >>>>>>> provide. >>>>>>> >>>>> >>>>> >>>>> >>>>>>> Thanks, >>>>>>> Tim >>>>>>> >>>>>> _______________________________________________ >>>>>> Openais mailing list >>>>>> [email protected] >>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais >>>>> >>>>> >>> >>> > > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
