On 08/09/2011 09:56 PM, Tim Beale wrote: > Hi Steve, > > Thanks for your patch. > > 1. I don't see the initial CLM leave events. But I still see the FAILED TO > RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20, > then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens. > Attached is the latest debug. >
Keep in mind there are two problems here - (1) clm membership is wrong and (2) fail to recv problem. They are independent issues. I definitely want to look into this failed to receive issue. Can you try changing "fail_recv_const" on all the nodes to some large value, such as 5000? One of 3 things should happen: 1. the protocol blocks forever 2. the protocol enters operational after some short period 3. fail to recv is printed after a long period of time (1-10 minutes). Please report back which one happens with this tuning. > I think the problem is some nodes end up missing a message/sequence-number, > although I'm not sure exactly why. E.g. the token sequence starts off at one > when they enter operational, but not all nodes receive this. > 2011 Aug 9 10:07:18 daemon.debug node-3 corosync[1575]: [TOTEM ] > totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1 > > The nodes that were still in recovery will be using different values for > old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes > receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE. > > The debug attached has your first memb-list patch popped off, but I've seen > the > same problem happen with it applied too. > > 2. Note that I don't see any CLM leave events at all now, even though after > the > FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think this > is due to the logic: > diff = my_new_memb_list - my_memb_list This isn't how the difference operation works. It produces a list of nodes that are not both in my_new_memb_list and my_memb_list, therefore, the current and logic should be correct. I wrote the patch at 2am and was quite tired, so I'll double check it is correct. Regards -steve > The diff doesn't include any nodes that are in my_memb_list but not in > my_new_memb_list, i.e. left nodes. I guess you could get all the differences > by doing the following: > memb_set_subtract( diff1, my_new_memb_list, my_memb_list ) > memb_set_subtract( diff2, my_memb_list, my_new_memb_list ) > memb_set_and( diff1, diff2, diff ) > > Thanks, > Tim > > On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake <[email protected]> wrote: >> On 08/08/2011 12:10 AM, Tim Beale wrote: >>> Hi Steve, >>> >>> Thanks for your help. I tried out your patch but the problem still >>> occurs. The problem looks to me due to the ring-IDs used when forming >>> the transitional memb-list, rather than with the memb-list itself. The >>> ring-ID of the nodes still in Recovery is older than the rest of the >>> nodes who have already shifted to Operational. >>> >>> Attached is my attempt at fixing the problem. The idea is to delay the >>> nodes processing a Memb-Join immediately after shifting to >>> Operational, until the token has rotated the ring once. >>> >>> It doesn't quite work either though. The nodes are still re-entering >>> gather before all have left recovery. This time it's due to processing >>> a Merge-Detect message. One node has just started up and set itself to >>> the rep, and sends out a Merge-Detect which triggers the other nodes >>> to enter gather and reform the ring. >>> >>> Let me know if you have any other advice. >>> >> >> the problem is clear from the blackbox - 8 nodes enter operational while >> 1 in recovery is interrupted by a join message. this interrupted node >> then proceeds with a transitional membership of 1 node (which is correct). >> >> The joined and left lists use the transitional list to determine their >> contents, which is not correct. This results in incorrect data >> delivered to clm. Try the follow-up patch which should correctly >> calculate the joined and left lists. >> >> >>> Thanks, >>> Tim >>> >>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <[email protected]> wrote: >>>> On 08/03/2011 10:32 PM, Tim Beale wrote: >>>>> Hi, >>>>> >>>>> It looks to me that the way the transition from Recovery to Operational >>>>> works, >>>>> we can't guarantee that all nodes in the ring have entered Operational >>>>> before >>>>> a node processes another Memb-Join message from a new node. E.g. we can't >>>>> guarantee the token has rotated right the way around the ring. >>>>> >>>>> When this happens, the nodes still in Recovery will still use the older >>>>> ring >>>>> ID. So they won't get added to the transitional membership, and CLM will >>>>> report >>>>> leave events for these nodes. (Plus there might be other side-effects, >>>>> like the >>>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's >>>>> happening). >>>>> >>>> >>>> Thanks for the pointer here - patch on ml. >>>> >>>>> We are currently using CLM to check the health of a node, i.e. so we can >>>>> detect >>>>> if it locks up. My questions are: >>>>> i) Are there config settings we could change to improve this, like >>>>> increasing >>>>> the 'join' timeout? >>>>> ii) Should I try to make a code change to fix the problem? E.g. delay >>>>> processing the Memb-Join message if the node's only just entered >>>>> operational. >>>>> iii) Should we not be using CLM like this? I.e. should we just learn to >>>>> live >>>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly >>>>> healthy. >>>>> >>>>> Thanks for your help. >>>>> Tim >>>>> >>>> >>>> Tim please try the patch I have recently posted: >>>> [PATCH] Set my_new_memb_list in recovery enter >>>> >>>> First and foremost, let me know if it resolves your 10 node startup case >>>> which fails 10% of the time. Then let me know if it treats other symptoms. >>>> >>>> Regards >>>> -steve >>>> >>>> >>>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> We're booting up a 10-node cluster (with all nodes starting corosync at >>>>>> roughly >>>>>> the same time) and approx 1 in 10 times we see some problems: >>>>>> a) CLM is reporting nodes as leaving and then immediately rejoining (not >>>>>> sure >>>>>> if this is valid behaviour?) >>>>>> b) Probably an unrelated oddity, but we're getting flow control enabled >>>>>> on a >>>>>> client daemon using CLM that's only sending one request >>>>>> (saClmClusterTrack()). >>>>>> c) A node is hitting the FAILED TO RECEIVE case >>>>>> d) After c) there seems to be a lot of churn as the cluster tries to >>>>>> reform >>>>>> e) During the processing of node leave events, the CPG client can >>>>>> sometimes get >>>>>> broken so it no longer processes *any* CPG events >>>>>> >>>>>> Corosync debug is attached (I commented out some of the noisier debug >>>>>> around >>>>>> message delivery). We don't really know enough about corosync to tell >>>>>> what >>>>>> exactly is incorrect behaviour and what should be fixed. But here's what >>>>>> we've >>>>>> noticed: >>>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except >>>>>> node-12 >>>>>> have entered operational state (see node-12.txt line 235). It looks like >>>>>> maybe >>>>>> node-12 hasn't received enough rotations of the token to enter >>>>>> operational yet. >>>>>> Node-12's resulting transitional config consists of just itself. All >>>>>> nodes then >>>>>> report node-1 and node-12 as leaving and immediately rejoining. >>>>>> 2) After this config change, node-3 eventually hits the FAILED TO >>>>>> RECEIVE case >>>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU >>>>>> matching >>>>>> the high_seq_received, all other nodes have an ARU of zero. >>>>>> 3) Node-3 entering gather seems to result in a lot of config change churn >>>>>> across the cluster. >>>>>> 4) While processing the config changes on node-3, the CPG downlist it >>>>>> uses >>>>>> contains itself. When node-3 sends leave events for the nodes in the >>>>>> downlist >>>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and >>>>>> clears >>>>>> the cpd->group_name. This means it no longer sends any CPG events to the >>>>>> CPG >>>>>> client. >>>>>> >>>>>> We tried cherry-picking this commit to fix the problem (#4) with the CPG >>>>>> client. >>>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577 >>>>>> It helped a bit, but didn't fix it completely. We've made an interim >>>>>> change >>>>>> (attached) to avoid this problem. >>>>>> >>>>>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec >>>>>> CPU). >>>>>> Corosync is running over a basic ethernet interface (no >>>>>> hubs/routers/etc). >>>>>> >>>>>> Any help would be appreciated. Let me know if there's any other debug I >>>>>> can >>>>>> provide. >>>>>> >>>> >>>> >>>> >>>>>> Thanks, >>>>>> Tim >>>>>> >>>>> _______________________________________________ >>>>> Openais mailing list >>>>> [email protected] >>>>> https://lists.linux-foundation.org/mailman/listinfo/openais >>>> >>>> >> >> _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
