[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)

Stefan Egli (Jira) Sun, 20 Jun 2021 08:08:08 -0700


    [ 
https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366195#comment-17366195
 ]


Stefan Egli commented on SLING-10489:
-------------------------------------

PRs updated with the following
* 'joiner delay' introduced : when an instance joins a discovery cluster with 
existing members, it waits before sending the first TOPOLOGY_INIT - by default 
30sec. This helps reducing a race-condition with partial startup suppressing : 
consider instances A and B up. Then B dies and simultaneously C starts up 
partially. In this case A will suppress C but at the same time notice that B 
has left, thus make a topology change with B leaving. Ie it will write the new 
sync token as per this new state. For A this means that as soon as it would 
finish the startup successfully, it would notice the sync token of A already 
written and it would immediately send a TOPOLOGY_INIT with (C and A). However, 
A might not be so fast, A still thinks the topology is just (A). Once it 
notices that C joined (which happens once every second) it will include C and 
declare a topology with (A and C). But there is a small time window where 
different cluster instances could have different views of the topology. And to 
avoid this, the joiner C does an artificial delay (hence stays without any 
topology) to give A enough time to read C's sync token.
* the default of the partial startup suppression got changed to infinity : 
there's no reason to stop suppressing a partial startup - if that instance 
doesn't write eg its syncToken, then it doesn't belong to the topology - no 
matter how long that takes. Having a timeout would be a compromise to 
eventually acknowledge that another instance is joining - but realistically 
that instance should be considered not part of the topology until it finishes 
the startup.

> Ignore partially started, newly joining instances to avoid disturbing 
> discovery (for a while)
> ---------------------------------------------------------------------------------------------
>
>                 Key: SLING-10489
>                 URL: https://issues.apache.org/jira/browse/SLING-10489
>             Project: Sling
>          Issue Type: Improvement
>          Components: Discovery
>    Affects Versions: Discovery Oak 1.2.34
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>             Fix For: Discovery Oak 1.2.36
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Discovery.oak requires that both Oak and Sling are operating normally in 
> order to declare victory and announce a new topology.
> The startup phase is especially tricky in this regard, since there are 
> multiple elements that need to get updated (some are in the Oak layer, some 
> in Sling) :
>  * lease & clusterNodeId : this is maintained by Oak
>  * idMap : this is maintained by IdMapService (Sling)
>  * leaderElectionId : this is maintained by OakViewChecker (Sling)
>  * syncToken : this is maintained by SyncTokenService (Sling)
> Situations have been seen where Oak starts up fine, but higher level (eg 
> Sling) bundles were not activated within a reasonable amount of time. This 
> lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected.
> There should be a mechanism that ignores (suppresses) newly joining instances 
> if they start up only partially. However, after a certain timeout this 
> mechanism should give up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)

Reply via email to