[
https://issues.apache.org/jira/browse/SLING-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906162#comment-14906162
]
Stefan Egli commented on SLING-5030:
------------------------------------
Added a follow-up ticket SLING-5058 that would go even further into trying to
avoid an unnecessary leader-switch in the case where the local instance doesn't
even detect the topology change yet but was nevertheless isolated from the
point of view of the others. This is an extreme edge case however and thus
out-of-scope for 1.1.8
> replace isolated mode with (larger) TOPOLOGY_CHANGING phase
> -----------------------------------------------------------
>
> Key: SLING-5030
> URL: https://issues.apache.org/jira/browse/SLING-5030
> Project: Sling
> Issue Type: Bug
> Components: Extensions
> Affects Versions: Discovery Impl 1.0.2
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Fix For: Discovery Impl 1.1.8
>
>
> As [described in
> SLING-3432|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
> one major reason why duplicate leaders happen in discovery.impl is the
> isolated mode: the rule of discovery API is that every instance is always in
> a cluster. That kind of makes sense. However, when the connection to the
> cluster (ie to the repository) is faulty or delayed for some reason - and the
> remaining cluster does no longer interpret the local instance as being alive
> (ie heartbeats have timed out), then currently the local instance notices
> this 'isolated' state and wraps itself into a pseudo cluster consisting only
> of itself. Of which it by definition is the leader.
> This is completely wrong: there should be no isolated mode. When this 'cut
> off' the cluster happens, the local instance should just immediately send out
> a TOPOLOGY_CHANGING and remain in this state until things have settled with
> the repository and it successfully has taken part of a voting. Only then can
> it send out a TOPOLOGY_CHANGED event.
> This should fix a large number of situations where SLING-3432 has been seen.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)