[jira] [Resolved] (SLING-3750) Delay discovery-service readiness until first vote has finished, to avoid leader being overthrown

Stefan Egli (JIRA) Thu, 29 Jan 2015 05:52:28 -0800

     [ 
https://issues.apache.org/jira/browse/SLING-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Egli resolved SLING-3750.
--------------------------------
    Resolution: Fixed

implemented: 'introducing a new feature which avoids 
duplicate-leaders-on-startup: the INIT event is now delayed (when configured to 
do so) until there is a valid, established cluster view. Earlier it used to 
immediately send an INIT on bind-topologylistener/activate - but if the voting 
was not finished yet (no established view yet), all it could do is come up with 
an 'isolated view' containing only itself. And since every cluster view has a 
leader, it automatically became leader. The new approach is more stable as it 
waits for the first voting to conclude. That might delay startup though - this 
is something to keep in mind here. Default of the new config is enabled 
nevertheless as IMO this is how it should be.'

> Delay discovery-service readiness until first vote has finished, to avoid 
> leader being overthrown
> -------------------------------------------------------------------------------------------------
>
>                 Key: SLING-3750
>                 URL: https://issues.apache.org/jira/browse/SLING-3750
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.0.8
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Critical
>             Fix For: Discovery Impl 1.0.14
>
>
> The current implementation of discovery.impl has a subtle problem at startup. 
> Consider the following problem happening with two simultaneous starts:
>  * two (sling) instances start at roughly the same time
>  * the goal is to write a service which runs on one of the two only, ever
>  * to achieve that, on a TopologyEventListener is used to get hold of the 
> latest TopologyView and derive whether the local instance is leader or not
>  * currently, upon registration of a TopologyEventListener, a TOPOLOGY_INIT 
> event is sent out immediately with the current TopologyView available
>  * right after startup though - hence before the first voting has passed - 
> discovery.impl considers itself to be in so-called "isolated" mode, creates a 
> topology which contains only itself, and makes itself leader (since every 
> cluster must have a leader)
>  * that means, both instances will receive that isolated view in the 
> TOPOLOGY_INIT and are marked as leader (which is kind of right as they don't 
> know about any other instance yet - but also wrong as it is not yet an 
> established view)
>  * at the same time, they both start voting, then find out about each other 
> and establish a view where one of the two is marked as leader - hence for the 
> other of the two a 'coup d'etat' is happening (the leader is overthrown even 
> though the instance did not crash). 
> This is certainly very problematic and should be avoided.
> The suggested way to avoid this is to delay both the time when the 
> discovery.impl service is registered with OSGi (by making it a @Component 
> only and registering it as a service explicitly after the first voting) - and 
> by delaying the sending of TOPOLOGY_INIT until again said first voting is 
> finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SLING-3750) Delay discovery-service readiness until first vote has finished, to avoid leader being overthrown

Reply via email to