[ 
https://issues.apache.org/jira/browse/OAK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930518#comment-16930518
 ] 

Stefan Egli commented on OAK-8627:
----------------------------------

Pushed changes in [this 
branch|https://github.com/stefan-egli/jackrabbit-oak/tree/OAK-8627] :
* [this 1st 
commit|https://github.com/stefan-egli/jackrabbit-oak/commit/f9c5ece53099714ee584986e75c4c7e8470409ac]
 is originally from [~mreutegg]
* [this 2nd 
commit|https://github.com/stefan-egli/jackrabbit-oak/commit/7e8ac4c0bf76db824a051c304a03ce02b93a40a1]
 fixes the test failures: basically some tests where doing a {{recover}} and 
thereafter in tearDown a {{dispose}} which failed due to the last 
backgroundWrite not succeeding (due to the recover). This is now an expected 
situation, hence I've adjusted the affected test cases to accept this. Another 
case was a concurrent recover which now can fail due to the new check : also 
handling that as acceptable now.

Other than that what I see as remaining failure scenario is that the root 
update could succeed but the lastWrittenRootRev not updated (eg due to a crash) 
- but this was actually already handled correctly now : the 
DocumentDiscoveryLiteService has a [{{lastKnownRevision >= 
lastWrittenRootRev}}|https://github.com/apache/jackrabbit-oak/blob/637a58a1c185033c6bfc2ca8259662fc1ad68255/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/DocumentDiscoveryLiteService.java#L533]
 check, so it handles this fine.

I'd suggest to apply this change. [~mreutegg] as you originally started working 
on this, can you please review and approve the change? thx!

> Avoid late-arriving lastRev update from crashed instance
> --------------------------------------------------------
>
>                 Key: OAK-8627
>                 URL: https://issues.apache.org/jira/browse/OAK-8627
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: documentmk
>    Affects Versions: 1.16.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>
> Recently a deployment with a two node cluster showed a Sling Discovery Oak 
> with a cluster view that had a clusterId stuck in the deactivating state.
> According to the entry in the clusterNodes collection, the clusterId in the 
> deactivating state was inactive. However, the revisions for the _lastRev 
> entry on the root document and the lastWrittenRootRev did not match. The 
> latter was slightly more recent. This caused the Sling Discovery Oak to 
> consider the clusterId as not entirely shut down.
> While there is no direct proof, one theoretical scenario [~mreutegg] 
> identified as a _potential_ root cause was that it can happen that the 
> lastRev for a clusterId on the root document is set back to an earlier value 
> due to a race condition:
> Before the lease expiry, the backgorund update thread could have issued an 
> update for the root document, which then took a very long time to reach the 
> DocumentStore, longer than the lease timeout and recovery which must have 
> been done by another instance meanwhile.
> If such a late-arriving update of the {{_lastRev}} is possible, then the 
> reset of the lastRev value on the root document could be explained, since the 
> update is currently done unconditionally.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to