Re: [Openais] Problems forming cluster on corosync startup

Steven Dake Mon, 15 Aug 2011 08:22:12 -0700

On 08/14/2011 01:34 PM, Tim Beale wrote:
> Hi Steve,
> 
> I repeated the test with fail_recv_const=5000. I can see the CPG
> client hung for ~4 minutes without dispatching any CPG events (i.e.
> node join). Unfortunately, one of our healthchecking mechanisms kicked
> in at this point, detected the CPG client as locked up and rebooted
> the units.
> 
> It definitely rules out #2. I can repeat the test with healthchecking
> disabled to narrow down if #1 or #3 will occur.
> 
> Regards,
> Tim
> 
> On Thu, Aug 11, 2011 at 4:21 AM, Steven Dake <[email protected]> wrote:
>> On 08/09/2011 09:56 PM, Tim Beale wrote:
>>> Hi Steve,
>>>
>>> Thanks for your patch.
>>>
>>> 1. I don't see the initial CLM leave events. But I still see the FAILED TO
>>> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20,
>>> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens.
>>> Attached is the latest debug.
>>>
>>
>> Keep in mind there are two problems here - (1) clm membership is wrong
>> and (2) fail to recv problem.  They are independent issues.
>>
>> I definitely want to look into this failed to receive issue.  Can you
>> try changing "fail_recv_const" on all the nodes to some large value,
>> such as 5000?
>>
>> One of 3 things should happen:
>> 1. the protocol blocks forever
>> 2. the protocol enters operational after some short period
>> 3. fail to recv is printed after a long period of time (1-10 minutes).
>>
>> Please report back which one happens with this tuning.
>>


Given that #1/#3 are basically what are occurring, I would love to have
a blackbox few seconds after config time and then couple minutes in.
Apparently something is wrong with the recovery in this test case.

Regards
-steve
>>
>>> I think the problem is some nodes end up missing a message/sequence-number,
>>> although I'm not sure exactly why. E.g. the token sequence starts off at one
>>> when they enter operational, but not all nodes receive this.
>>> 2011 Aug  9 10:07:18 daemon.debug node-3 corosync[1575]:   [TOTEM ]
>>> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1
>>>
>>> The nodes that were still in recovery will be using different values for
>>> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes
>>> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE.
>>>
>>> The debug attached has your first memb-list patch popped off, but I've seen 
>>> the
>>> same problem happen with it applied too.
>>>
>>> 2. Note that I don't see any CLM leave events at all now, even though after 
>>> the
>>> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think 
>>> this
>>> is due to the logic:
>>>  diff = my_new_memb_list - my_memb_list
>>
>> This isn't how the difference operation works.  It produces a list of
>> nodes that are not both in my_new_memb_list and my_memb_list, therefore,
>> the current and logic should be correct.  I wrote the patch at 2am and
>> was quite tired, so I'll double check it is correct.
>>
>> Regards
>> -steve
>>
>>> The diff doesn't include any nodes that are in my_memb_list but not in
>>> my_new_memb_list, i.e. left nodes. I guess you could get all the differences
>>> by doing the following:
>>>  memb_set_subtract( diff1, my_new_memb_list, my_memb_list )
>>>  memb_set_subtract( diff2, my_memb_list, my_new_memb_list )
>>>  memb_set_and( diff1, diff2, diff )
>>>
>>> Thanks,
>>> Tim
>>>
>>> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake <[email protected]> wrote:
>>>> On 08/08/2011 12:10 AM, Tim Beale wrote:
>>>>> Hi Steve,
>>>>>
>>>>> Thanks for your help. I tried out your patch but the problem still
>>>>> occurs. The problem looks to me due to the ring-IDs used when forming
>>>>> the transitional memb-list, rather than with the memb-list itself. The
>>>>> ring-ID of the nodes still in Recovery is older than the rest of the
>>>>> nodes who have already shifted to Operational.
>>>>>
>>>>> Attached is my attempt at fixing the problem. The idea is to delay the
>>>>> nodes processing a Memb-Join immediately after shifting to
>>>>> Operational, until the token has rotated the ring once.
>>>>>
>>>>> It doesn't quite work either though. The nodes are still re-entering
>>>>> gather before all have left recovery. This time it's due to processing
>>>>> a Merge-Detect message. One node has just started up and set itself to
>>>>> the rep, and sends out a Merge-Detect which triggers the other nodes
>>>>> to enter gather and reform the ring.
>>>>>
>>>>> Let me know if you have any other advice.
>>>>>
>>>>
>>>> the problem is clear from the blackbox - 8 nodes enter operational while
>>>> 1 in recovery is interrupted by a join message.  this interrupted node
>>>> then proceeds with a transitional membership of 1 node (which is correct).
>>>>
>>>> The joined and left lists use the transitional list to determine their
>>>> contents, which is not correct.  This results in incorrect data
>>>> delivered to clm.  Try the follow-up patch which should correctly
>>>> calculate the joined and left lists.
>>>>
>>>>
>>>>> Thanks,
>>>>> Tim
>>>>>
>>>>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <[email protected]> wrote:
>>>>>> On 08/03/2011 10:32 PM, Tim Beale wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> It looks to me that the way the transition from Recovery to Operational 
>>>>>>> works,
>>>>>>> we can't guarantee that all nodes in the ring have entered Operational 
>>>>>>> before
>>>>>>> a node processes another Memb-Join message from a new node. E.g. we 
>>>>>>> can't
>>>>>>> guarantee the token has rotated right the way around the ring.
>>>>>>>
>>>>>>> When this happens, the nodes still in Recovery will still use the older 
>>>>>>> ring
>>>>>>> ID. So they won't get added to the transitional membership, and CLM 
>>>>>>> will report
>>>>>>> leave events for these nodes. (Plus there might be other side-effects, 
>>>>>>> like the
>>>>>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's 
>>>>>>> happening).
>>>>>>>
>>>>>>
>>>>>> Thanks for the pointer here - patch on ml.
>>>>>>
>>>>>>> We are currently using CLM to check the health of a node, i.e. so we 
>>>>>>> can detect
>>>>>>> if it locks up. My questions are:
>>>>>>> i) Are there config settings we could change to improve this, like 
>>>>>>> increasing
>>>>>>> the 'join' timeout?
>>>>>>> ii) Should I try to make a code change to fix the problem? E.g. delay
>>>>>>> processing the Memb-Join message if the node's only just entered 
>>>>>>> operational.
>>>>>>> iii) Should we not be using CLM like this? I.e. should we just learn to 
>>>>>>> live
>>>>>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
>>>>>>> healthy.
>>>>>>>
>>>>>>> Thanks for your help.
>>>>>>> Tim
>>>>>>>
>>>>>>
>>>>>> Tim please try the patch I have recently posted:
>>>>>> [PATCH] Set my_new_memb_list in recovery enter
>>>>>>
>>>>>> First and foremost, let me know if it resolves your 10 node startup case
>>>>>> which fails 10% of the time.  Then let me know if it treats other 
>>>>>> symptoms.
>>>>>>
>>>>>> Regards
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <[email protected]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We're booting up a 10-node cluster (with all nodes starting corosync 
>>>>>>>> at roughly
>>>>>>>> the same time) and approx 1 in 10 times we see some problems:
>>>>>>>> a) CLM is reporting nodes as leaving and then immediately rejoining 
>>>>>>>> (not sure
>>>>>>>> if this is valid behaviour?)
>>>>>>>> b) Probably an unrelated oddity, but we're getting flow control 
>>>>>>>> enabled on a
>>>>>>>> client daemon using CLM that's only sending one request 
>>>>>>>> (saClmClusterTrack()).
>>>>>>>> c) A node is hitting the FAILED TO RECEIVE case
>>>>>>>> d) After c) there seems to be a lot of churn as the cluster tries to 
>>>>>>>> reform
>>>>>>>> e) During the processing of node leave events, the CPG client can 
>>>>>>>> sometimes get
>>>>>>>> broken so it no longer processes *any* CPG events
>>>>>>>>
>>>>>>>> Corosync debug is attached (I commented out some of the noisier debug 
>>>>>>>> around
>>>>>>>> message delivery). We don't really know enough about corosync to tell 
>>>>>>>> what
>>>>>>>> exactly is incorrect behaviour and what should be fixed. But here's 
>>>>>>>> what we've
>>>>>>>> noticed:
>>>>>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>>>>>>>> node-12
>>>>>>>> have entered operational state (see node-12.txt line 235). It looks 
>>>>>>>> like maybe
>>>>>>>> node-12 hasn't received enough rotations of the token to enter 
>>>>>>>> operational yet.
>>>>>>>> Node-12's resulting transitional config consists of just itself. All 
>>>>>>>> nodes then
>>>>>>>> report node-1 and node-12 as leaving and immediately rejoining.
>>>>>>>> 2) After this config change, node-3 eventually hits the FAILED TO 
>>>>>>>> RECEIVE case
>>>>>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU 
>>>>>>>> matching
>>>>>>>> the high_seq_received, all other nodes have an ARU of zero.
>>>>>>>> 3) Node-3 entering gather seems to result in a lot of config change 
>>>>>>>> churn
>>>>>>>> across the cluster.
>>>>>>>> 4) While processing the config changes on node-3, the CPG downlist it 
>>>>>>>> uses
>>>>>>>> contains itself. When node-3 sends leave events for the nodes in the 
>>>>>>>> downlist
>>>>>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED 
>>>>>>>> and clears
>>>>>>>> the cpd->group_name. This means it no longer sends any CPG events to 
>>>>>>>> the CPG
>>>>>>>> client.
>>>>>>>>
>>>>>>>> We tried cherry-picking this commit to fix the problem (#4) with the 
>>>>>>>> CPG client.
>>>>>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>>>>>>>> It helped a bit, but didn't fix it completely. We've made an interim 
>>>>>>>> change
>>>>>>>> (attached) to avoid this problem.
>>>>>>>>
>>>>>>>> We're using corosync v1.3.1 on an embedded linux system (with a 
>>>>>>>> low-spec CPU).
>>>>>>>> Corosync is running over a basic ethernet interface (no 
>>>>>>>> hubs/routers/etc).
>>>>>>>>
>>>>>>>> Any help would be appreciated. Let me know if there's any other debug 
>>>>>>>> I can
>>>>>>>> provide.
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tim
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Openais mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

Reply via email to