[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920687#comment-16920687
 ] 

Anton Vinogradov commented on IGNITE-12133:
-------------------------------------------

[~mnk], 

1) In the case of skip-list approach usage, we will lose discovery message 
processing order guarantee.

P.s. Not sure we need it at some case :) just an addition.


2) BTW, we already have a discovery implementation based on ZooKeeper. 
([https://apacheignite.readme.io/docs/zookeeper-discovery])

Anyway, seems, this optimization will speed-up the ring instead of replacement 
with a star.

 

Could you start devlist discussion before the implementation?

> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to