[ 
https://issues.apache.org/jira/browse/SLING-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528138#comment-14528138
 ] 

Stefan Egli edited comment on SLING-4627 at 5/5/15 9:08 AM:
------------------------------------------------------------

I've added [^SLING-4627.patch] with a suggestion of a 
{{o.a.s.discovery.commons.providers.spi.ConsistencyService}} and its 
integration into the {{ViewStateManager}} and would appreciate 
reviews/comments. Note that {{ConsistencyService}} is a plain interface at this 
stage and passed to {{ViewStateManager}} in the constructor – it can however 
still be implemented via a service and typically be injected by the 
{{DiscoveryService}} (which would typically instantiate the 
{{ViewStateManager}}).

Implementation of {{ConsistencyService}} is yet to follow – the intended 
algorithm:

 * rule #1: the algorithm worries about effects of eventual consistency (ie its 
delays) of the underlying, local repository vs a potentially quicker discovery 
detection – thus it applies only to the local cluster and does not worry about 
changes outside of it
 * rule #2: [every TopologyEventListener must stop doing 
leader/topology-dependent actions, esp towards the repository, as soon as it 
receives 
TOPOLOGY_CHANGING|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
 * rule #3: the ViewStateManager makes use of rule #2: on any TOPOLOGY_CHANGING 
that affects the local cluster it writes a 'sync token' to a well-known 
location in a conflict-free way: the sync token signals anyone interested, that 
the corresponding instance *has sent* the corresponding topology event. When 
peers are able to 'read' this 'sync token' they can safely assume they have 
read all changes of that instance too (since the repository has to guarantee 
ordering of one instance's updates). The sync token has the following structure:
 ** path: {{/var/discovery/commons/syncTokens/<slingId>}}
 ** property: {{status}}
 ** value: {{changing=<localClusterSyncId>}}, {{changed=<localClusterSyncId>}}
 * rule #4: the {{localClusterSyncId}} is defined in 
{{BaseTopologyView.getLocalClusterSyncId()}} and is thus implementation 
specific – but it has to uniquely identify a particular incarnation of a local 
clusterview change (see javadoc for more details, in discovery.impl this could 
have been the voting id)
 * rule #5: whenever an instance leaves the local cluster (detected using 
SLING-4665) the {{ViewStateManager}} invokes this {{ConsistencyService}} to 
wait for all syncTokens to become visible (either changing or changed with 
current, expected {{localClusterSyncId}})
 * rule #6: there are three cases which require special handling – all can be 
dealt with by waiting a minimal amount of time – suggestion is to use a 
{{minEventDelay}}. The actual handling of this however is in 
{{ConsistencyService}}, so implementation specific. Here's the famous 3:
 ** the crashed instance's cannot have any write-backlog – it just crashed. Any 
remaining half-finished work will be rolled back (in the oak case). Yet, the 
last writes that it did towards the repository might not yet be read by the 
local instance – to account for this, we need a : {{minEventDelay}}
 ** if the instance is the only one left in the cluster, there's no sync token 
to wait for – hence it would immediately declare synched. But in this case too 
it needs to make sure the potential read-backlog is processed. For this, no 
surprise, we also use {{minEventDelay}}
 ** if the instance that left is still alive (eg in a SLING-3432-like 
pseudo-partitioning case that can happen with any discovery implementation), 
the instance' repository writes would not be synched in any way with the 
remaining instances (as it left the cluster view). Even though it would 
eventually receive a TOPOLOGY_CHANGING, and according to rule #2 stop writing 
critical stuff, in the meantime it could still do this. Thus here too, a 
minimal wait time for the partitioned-away instance to take note of a change is 
required - thus yet again {{minEventDelay}}

[~cziegeler], [~marett], wdyt?


was (Author: egli):
I've added [^SLING-4627.patch] with a suggestion of a 
{{o.a.s.discovery.commons.providers.spi.ConsistencyService}} and its 
integration into the {{ViewStateManager}} and would appreciate 
reviews/comments. Note that {{ConsistencyService}} is a plain interface at this 
stage and passed to {{ViewStateManager}} in the constructor – it can however 
still be implemented via a service and typically by injected by the 
{{DiscoveryService}} (which would typically instantiate the 
{{ViewStateManager}}).

Implementation of {{ConsistencyService}} is yet to follow – the intended 
algorithm:

 * rule #1: the algorithm worries about effects of eventual consistency (ie its 
delays) of the underlying, local repository vs a potentially quicker discovery 
detection – thus it applies only to the local cluster and does not worry about 
changes outside of it
 * rule #2: [every TopologyEventListener must stop doing 
leader/topology-dependent actions, esp towards the repository, as soon as it 
receives 
TOPOLOGY_CHANGING|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
 * rule #3: the ViewStateManager makes use of rule #2: on any TOPOLOGY_CHANGING 
that affects the local cluster it writes a 'sync token' to a well-known 
location in a conflict-free way: the sync token signals anyone interested, that 
the corresponding instance *has sent* the corresponding topology event. When 
peers are able to 'read' this 'sync token' they can safely assume they have 
read all changes of that instance too (since the repository has to guarantee 
ordering of one instance's updates). The sync token has the following structure:
 ** path: {{/var/discovery/commons/syncTokens/<slingId>}}
 ** property: {{status}}
 ** value: {{changing=<localClusterSyncId>}}, {{changed=<localClusterSyncId>}}
 * rule #4: the {{localClusterSyncId}} is defined in 
{{BaseTopologyView.getLocalClusterSyncId()}} and is thus implementation 
specific – but it has to uniquely identify a particular incarnation of a local 
clusterview change (see javadoc for more details, in discovery.impl this could 
have been the voting id)
 * rule #5: whenever an instance leaves the local cluster (detected using 
SLING-4665) the {{ViewStateManager}} invokes this {{ConsistencyService}} to 
wait for all syncTokens to become visible (either changing or changed with 
current, expected {{localClusterSyncId}})
 * rule #6: there are three cases which require special handling – all can be 
dealt with by waiting a minimal amount of time – suggestion is to use a 
{{minEventDelay}}. The actual handling of this however is in 
{{ConsistencyService}}, so implementation specific. Here's the famous 3:
 ** the crashed instance's cannot have any write-backlog – it just crashed. Any 
remaining half-finished work will be rolled back (in the oak case). Yet, the 
last writes that it did towards the repository might not yet be read by the 
local instance – to account for this, we need a : {{minEventDelay}}
 ** if the instance is the only one left in the cluster, there's no sync token 
to wait for – hence it would immediately declare synched. But in this case too 
it needs to make sure the potential read-backlog is processed. For this, no 
surprise, we also use {{minEventDelay}}
 ** if the instance that left is still alive (eg in a SLING-3432-like 
pseudo-partitioning case that can happen with any discovery implementation), 
the instance' repository writes would not be synched in any way with the 
remaining instances (as it left the cluster view). Even though it would 
eventually receive a TOPOLOGY_CHANGING, and according to rule #2 stop writing 
critical stuff, in the meantime it could still do this. Thus here too, a 
minimal wait time for the partitioned-away instance to take note of a change is 
required - thus yet again {{minEventDelay}}

[~cziegeler], [~marett], wdyt?

> TOPOLOGY_CHANGED in an eventually consistent repository
> -------------------------------------------------------
>
>                 Key: SLING-4627
>                 URL: https://issues.apache.org/jira/browse/SLING-4627
>             Project: Sling
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Critical
>         Attachments: SLING-4627.patch, SLING-4627.patch
>
>
> This is a parent ticket describing the +coordination effort needed between 
> properly sending TOPOLOGY_CHANGED when running ontop of an eventually 
> consistent repository+. These findings are independent of the implementation 
> details used inside the discovery implementation, so apply to discovery.impl, 
> discovery.etcd/.zookeeper/.oak etc. Tickets to implement this for specific 
> implementation are best created separately (eg sub-task or related..). Also 
> note that this assumes immediately sending TOPOLOGY_CHANGING as described [in 
> SLING-3432|https://issues.apache.org/jira/browse/SLING-3432?focusedCommentId=14492494&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14492494]
> h5. The spectrum of possible TOPOLOGY_CHANGED events include the following 
> scenarios:
> || scenario || classification || action ||
> | A. change is completely outside of local cluster | (/) uncritical | changes 
> outside the cluster are considered uncritical for this exercise. |
> | B. a new instance joins the local cluster, this new instance is by contract 
> not the leader (leader must be stable \[0\]) | (/) uncritical | a join of an 
> instance is uncritical due to the fact that it merely joins the cluster and 
> has thus no 'backlog' of changes that might be propagating through the 
> (eventually consistent) repository. |
> | C. a non-leader *leaves* the local cluster | (x) *critical* | changes that 
> were written by the leaving instance might still not be *seen* by all 
> surviving (ie it can be that discovery is faster than the repository) and 
> this must be assured before sending out TOPOLOGY_CHANGED. This is because the 
> leaving instance could have written changes that are *topology dependent* and 
> thus those changes must first be settled in the repository before continuing 
> with a *new topology*. |
> | D. the leader *leaves* the local cluster (and thus a new leader is elected) 
> | (x)(x) *very critical* | same as C except that this is more critical due to 
> the fact that the leader left |
> | E. -the leader of the local cluster changes (without leaving)- this is not 
> supported by contract (leader must be stable \[0\]) | (/) -irrelevant- | |
> So both C and D are about an instance leaving. And as mentioned above the 
> survivors must assure they have read all changes of the leavers. There are 
> two parts to this:
> * the leaver could have pending writes that are not yet in mongoD: I don't 
> think this is the case. The only thing that can remain could be an 
> uncommitted branch and that would be rolled back afaik.
> ** Exception to this is a partition: where the leaver didn't actually crash 
> but is still hooked to the repository. *For this I'm not sure how it can be 
> solved* yet.
> * the survivers could however not yet have read all changes (pending in the 
> background read) and one way to make sure they did is to have each surviving 
> instance write a (pseudo-) sync token to the repository. Once all survivors 
> have seen this sync token of all other survivors, the assumption is that all 
> pending changes are "flushed" through the eventually consistent repository 
> and that it is safe to send out a TOPOLOGY_CHANGED event. 
> * this sync token must be *conflict free* and could be eg: 
> {{/var/discovery/oak/clusterInstances/<slingId>/syncTokens/<newViewId>}} - 
> where {{newViewId}} is defined by whatever discovery mechanism is used
> * a special case is when only one instance is remaining. It can then not wait 
> for any other survivor to send a sync token. In that case sync tokens would 
> not work. All it could then possibly do is to wait for a certain time (which 
> should be larger than any expected background-read duration)
> [~mreutegg], [~chetanm] can you pls confirm/comment on the above "flush/sync 
> token" approach? Thx!
> /cc [~marett]
> \[0\] - see [getLeader() in 
> ClusterView|https://github.com/apache/sling/blob/trunk/bundles/extensions/discovery/api/src/main/java/org/apache/sling/discovery/ClusterView.java]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to