[ 
https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907623#comment-14907623
 ] 

Chetan Mehrotra commented on OAK-3436:
--------------------------------------

Linking to OAK-2961 as similar warning about missing checkpoints were seen 
there also. At that time we concluded that missing checkpoint can happen (like 
case 1 here which eventually failed). The testcase can be checked to determine 
if in some case after missing checkpoint index still progresses or not

> Prevent missing checkpoint due to unstable topology from causing complete 
> reindexing
> ------------------------------------------------------------------------------------
>
>                 Key: OAK-3436
>                 URL: https://issues.apache.org/jira/browse/OAK-3436
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.7, 1.2.7, 1.0.22
>
>         Attachments: AsyncIndexUpdateClusterTest.java
>
>
> Async indexing logic relies on embedding application to ensure that async 
> indexing job is run as a singleton in a cluster. For Sling based apps it 
> depends on Sling Discovery support. At times it is being seen that if 
> topology is not stable then different cluster nodes can consider them as 
> leader and execute the async indexing job concurrently.
> This can cause problem as both cluster node might not see same repository 
> state (due to write skew and eventual consistency) and might remove the 
> checkpoint which other cluster node is still relying upon. For e.g. consider 
> a 2 node cluster N1 and N2 where both are performing async indexing.
> # Base state - CP1 is the checkpoint for "async" job
> # N2 starts indexing and removes changes CP1 to CP2. For Mongo the 
> checkpoints are saved in {{settings}} collection
> # N1 also decides to execute indexing but has yet not seen the latest 
> repository state so still thinks that CP1 is the base checkpoint and tries to 
> read it. However CP1 is already removed from {{settings}} and this makes N1 
> think that checkpoint is missing and it decides to reindex everything!
> To avoid this topology must be stable but at Oak level we should still handle 
> such a case and avoid doing a full reindexing. So we would need to have a 
> {{MissingCheckpointStrategy}} similar to {{MissingIndexEditorStrategy}} as 
> done in OAK-2203 
> Possible approaches
> # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being 
> missing can have valid reason and invalid reason. Need to see what are valid 
> scenarios where a checkpoint can go missing
> # A2 - When a checkpoint is created also store the creation time. When a 
> checkpoint is found to be missing and its a *recent* checkpoint then fail the 
> run. For e.g. we would fail the run till checkpoint found to be missing is 
> less than an hour old (for just started take startup time into account)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to