[
https://issues.apache.org/jira/browse/OAK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933371#comment-16933371
]
Marcel Reutegger commented on OAK-8627:
---------------------------------------
Remaining changes to discuss: [^OAK-8627.patch].
> Avoid late-arriving lastRev update from crashed instance
> --------------------------------------------------------
>
> Key: OAK-8627
> URL: https://issues.apache.org/jira/browse/OAK-8627
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: documentmk
> Affects Versions: 1.16.0
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Priority: Major
> Attachments: OAK-8627.patch
>
>
> Recently a deployment with a two node cluster showed a Sling Discovery Oak
> with a cluster view that had a clusterId stuck in the deactivating state.
> According to the entry in the clusterNodes collection, the clusterId in the
> deactivating state was inactive. However, the revisions for the _lastRev
> entry on the root document and the lastWrittenRootRev did not match. The
> latter was slightly more recent. This caused the Sling Discovery Oak to
> consider the clusterId as not entirely shut down.
> While there is no direct proof, one theoretical scenario [~mreutegg]
> identified as a _potential_ root cause was that it can happen that the
> lastRev for a clusterId on the root document is set back to an earlier value
> due to a race condition:
> Before the lease expiry, the backgorund update thread could have issued an
> update for the root document, which then took a very long time to reach the
> DocumentStore, longer than the lease timeout and recovery which must have
> been done by another instance meanwhile.
> If such a late-arriving update of the {{_lastRev}} is possible, then the
> reset of the lastRev value on the root document could be explained, since the
> update is currently done unconditionally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)