On 08/03/2011 10:32 PM, Tim Beale wrote:
> Hi,
> 
> It looks to me that the way the transition from Recovery to Operational works,
> we can't guarantee that all nodes in the ring have entered Operational before
> a node processes another Memb-Join message from a new node. E.g. we can't
> guarantee the token has rotated right the way around the ring.
> 
> When this happens, the nodes still in Recovery will still use the older ring
> ID. So they won't get added to the transitional membership, and CLM will 
> report
> leave events for these nodes. (Plus there might be other side-effects, like 
> the
> FAILED TO RECEIVE problem - I haven't quite worked out why that's happening).
> 

Thanks for the pointer here - patch on ml.

> We are currently using CLM to check the health of a node, i.e. so we can 
> detect
> if it locks up. My questions are:
> i) Are there config settings we could change to improve this, like increasing
> the 'join' timeout?
> ii) Should I try to make a code change to fix the problem? E.g. delay
> processing the Memb-Join message if the node's only just entered operational.
> iii) Should we not be using CLM like this? I.e. should we just learn to live
> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
> healthy.
> 
> Thanks for your help.
> Tim
> 

Tim please try the patch I have recently posted:
[PATCH] Set my_new_memb_list in recovery enter

First and foremost, let me know if it resolves your 10 node startup case
which fails 10% of the time.  Then let me know if it treats other symptoms.

Regards
-steve


> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <[email protected]> wrote:
>> Hi,
>>
>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>> roughly
>> the same time) and approx 1 in 10 times we see some problems:
>> a) CLM is reporting nodes as leaving and then immediately rejoining (not sure
>> if this is valid behaviour?)
>> b) Probably an unrelated oddity, but we're getting flow control enabled on a
>> client daemon using CLM that's only sending one request 
>> (saClmClusterTrack()).
>> c) A node is hitting the FAILED TO RECEIVE case
>> d) After c) there seems to be a lot of churn as the cluster tries to reform
>> e) During the processing of node leave events, the CPG client can sometimes 
>> get
>> broken so it no longer processes *any* CPG events
>>
>> Corosync debug is attached (I commented out some of the noisier debug around
>> message delivery). We don't really know enough about corosync to tell what
>> exactly is incorrect behaviour and what should be fixed. But here's what 
>> we've
>> noticed:
>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>> node-12
>> have entered operational state (see node-12.txt line 235). It looks like 
>> maybe
>> node-12 hasn't received enough rotations of the token to enter operational 
>> yet.
>> Node-12's resulting transitional config consists of just itself. All nodes 
>> then
>> report node-1 and node-12 as leaving and immediately rejoining.
>> 2) After this config change, node-3 eventually hits the FAILED TO RECEIVE 
>> case
>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU matching
>> the high_seq_received, all other nodes have an ARU of zero.
>> 3) Node-3 entering gather seems to result in a lot of config change churn
>> across the cluster.
>> 4) While processing the config changes on node-3, the CPG downlist it uses
>> contains itself. When node-3 sends leave events for the nodes in the downlist
>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and 
>> clears
>> the cpd->group_name. This means it no longer sends any CPG events to the CPG
>> client.
>>
>> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
>> client.
>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>> It helped a bit, but didn't fix it completely. We've made an interim change
>> (attached) to avoid this problem.
>>
>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec 
>> CPU).
>> Corosync is running over a basic ethernet interface (no hubs/routers/etc).
>>
>> Any help would be appreciated. Let me know if there's any other debug I can
>> provide.
>>



>> Thanks,
>> Tim
>>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to