[ 
https://issues.apache.org/jira/browse/OAK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Reschke updated OAK-8627:
--------------------------------
    Description: 
Recently a deployment with a two node cluster showed a Sling Discovery Oak with 
a cluster view that had a clusterId stuck in the deactivating state.

According to the entry in the clusterNodes collection, the clusterId in the 
deactivating state was inactive. However, the revisions for the _lastRev entry 
on the root document and the lastWrittenRootRev did not match. The latter was 
slightly more recent. This caused the Sling Discovery Oak to consider the 
clusterId as not entirely shut down.

While there is no direct proof, one theoretical scenario [~mreutegg] identified 
as a _potential_ root cause was that it can happen that the lastRev for a 
clusterId on the root document is set back to an earlier value due to a race 
condition:

Before the lease expiry, the background update thread could have issued an 
update for the root document, which then took a very long time to reach the 
DocumentStore, longer than the lease timeout and recovery which must have been 
done by another instance meanwhile.

If such a late-arriving update of the {{_lastRev}} is possible, then the reset 
of the lastRev value on the root document could be explained, since the update 
is currently done unconditionally.

  was:
Recently a deployment with a two node cluster showed a Sling Discovery Oak with 
a cluster view that had a clusterId stuck in the deactivating state.

According to the entry in the clusterNodes collection, the clusterId in the 
deactivating state was inactive. However, the revisions for the _lastRev entry 
on the root document and the lastWrittenRootRev did not match. The latter was 
slightly more recent. This caused the Sling Discovery Oak to consider the 
clusterId as not entirely shut down.

While there is no direct proof, one theoretical scenario [~mreutegg] identified 
as a _potential_ root cause was that it can happen that the lastRev for a 
clusterId on the root document is set back to an earlier value due to a race 
condition:

Before the lease expiry, the backgorund update thread could have issued an 
update for the root document, which then took a very long time to reach the 
DocumentStore, longer than the lease timeout and recovery which must have been 
done by another instance meanwhile.

If such a late-arriving update of the {{_lastRev}} is possible, then the reset 
of the lastRev value on the root document could be explained, since the update 
is currently done unconditionally.


> Avoid late-arriving lastRev update from crashed instance
> --------------------------------------------------------
>
>                 Key: OAK-8627
>                 URL: https://issues.apache.org/jira/browse/OAK-8627
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: documentmk
>    Affects Versions: 1.16.0
>            Reporter: Stefan Egli
>            Assignee: Marcel Reutegger
>            Priority: Major
>             Fix For: 1.18.0
>
>         Attachments: OAK-8627-2.patch, OAK-8627.patch
>
>
> Recently a deployment with a two node cluster showed a Sling Discovery Oak 
> with a cluster view that had a clusterId stuck in the deactivating state.
> According to the entry in the clusterNodes collection, the clusterId in the 
> deactivating state was inactive. However, the revisions for the _lastRev 
> entry on the root document and the lastWrittenRootRev did not match. The 
> latter was slightly more recent. This caused the Sling Discovery Oak to 
> consider the clusterId as not entirely shut down.
> While there is no direct proof, one theoretical scenario [~mreutegg] 
> identified as a _potential_ root cause was that it can happen that the 
> lastRev for a clusterId on the root document is set back to an earlier value 
> due to a race condition:
> Before the lease expiry, the background update thread could have issued an 
> update for the root document, which then took a very long time to reach the 
> DocumentStore, longer than the lease timeout and recovery which must have 
> been done by another instance meanwhile.
> If such a late-arriving update of the {{_lastRev}} is possible, then the 
> reset of the lastRev value on the root document could be explained, since the 
> update is currently done unconditionally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to