[ 
https://issues.apache.org/jira/browse/SLING-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115218#comment-15115218
 ] 

Stefan Egli commented on SLING-5435:
------------------------------------

[~ianeboston], parts of this concern has already been worked upon. During 
discussions around discovery.oak and discovery.etcd it became clear that both 
of these had the problem of propagating the 'leader change' information faster 
than changes propagate in the repository. Which opened up the problem of 
getting notified about a leader change before the last changes of a perhaps 
crashed/shutdown instance have been seen by all other remaining instances (this 
is just one example, another is threading with the {{TopologyEvent}} itself).

This lead to the conclusion that such 'fast leader detection mechanisms' 
require additional synchronization with the repository. 

For the new discovery.oak this has been implemented as a separate (spi) 
interface called {{ClusterSyncService}} which can be enabled/disabled via 
configuration. So you can already run discovery.oak with a fast leader detector 
without synchronization - except that the application then has to deal with the 
missing synchronization one way or another.

Sounds like what might be missing is some kind of generic support for the case 
where this synchronization is disabled from the discovery mechanism. Perhaps 
what might be useful is to group the {{TopologyEventListeners}} into those that 
want synchronization and those that explicitly don't want it?

> Decouple processes that depend on cluster leader elections from the cluster 
> leader elections.
> ---------------------------------------------------------------------------------------------
>
>                 Key: SLING-5435
>                 URL: https://issues.apache.org/jira/browse/SLING-5435
>             Project: Sling
>          Issue Type: Improvement
>          Components: General
>            Reporter: Ian Boston
>
> Currently there are many processes in Sling that must complete before a Sling 
> Discovery cluster leader election is declared complete. These processes 
> include things like transferring all Jobs from the old leader to the new 
> leader and waiting for the data to appear visible on the new leader. This 
> introduces an additional overhead to the leader election process which 
> introduces a higher than desirable timeout for elections and heartbeat. This 
> higher than desirable timeout precludes the use of more efficient election 
> and distributed consensus algorithms as implemented in Etcd, Zookeeper or 
> implementations of RAFT.
> If the election could be declared complete leaving individual components to 
> manage their own post election operations (ie decoupling those processes from 
> the election), then faster election or alternative Discovery implementations 
> such as the one implemented on etcd could be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to