[ 
https://issues.apache.org/jira/browse/OAK-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644567#comment-14644567
 ] 

Stefan Egli commented on OAK-2739:
----------------------------------

Two possible strategies that could be followed-up here:

h3. Detect and React
* Detection: the {{BackgroundLeaseUpdate}} thread finds out that in the middle 
of operation (ie not at startup) the lease was not existing when it expected it 
to still exist as it was created by itself just an interval ago. So detection 
of this situation is easy.
* Reaction: reaction however is difficult and likely impossible. In theory an 
arbitrary long time can have passed between the least timeout and when this 
gets detected. And during this time there is nothing that prevents the 
{{DocumentNodeStore}} from writing new stuff in the documents at all. For most 
of the data written in this phase it's not much of a problem. But for data for 
example that is _topology dependent_ (eg dependent on the instance being a 
leader) it can result in _duplicate leader situations_ which would not be 
resolvable after-the-fact anymore.

h3. Prevent
An alternative approach would be to prevent such a situation entirely. An 
instance would only ever modify the {{DocumentStore}} when its lease is still 
valid. 
* Now this cannot be made dependent on the persisted lease state - as that 
thread could again be blocked/prevented from updating the lease etc. 
* But perhaps a more robust and simpler approach would be to run an internal 
countdown watch upon every lease renewal and *allow modifying requests to the 
DocumentStore only when this clock has not yet hit zero*. This could be done 
with eg half of the lease-time - or with any time that has a reasonable margin 
compared to the lease update and lease timeout values.

IMO we should go the 'prevent' way with an explicit lease-check before each 
document modification. (this check would therefore have to be implemented in a 
very performing way, but that should be a no-brainer). 

I'll follow up on this idea and will come up with a patch.

/cc [~mreutegg], [~chetanm], [~reschke], wdyt? 

> take appropriate action when lease cannot be renewed (in time)
> --------------------------------------------------------------
>
>                 Key: OAK-2739
>                 URL: https://issues.apache.org/jira/browse/OAK-2739
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: mongomk
>    Affects Versions: 1.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>              Labels: resilience
>             Fix For: 1.3.5
>
>
> Currently, in an oak-cluster when (e.g.) one oak-client stops renewing its 
> lease (ClusterNodeInfo.renewLease()), this will be eventually noticed by the 
> others in the same oak-cluster. Those then mark this client as {{inactive}} 
> and start recoverying and subsequently removing that node from any further 
> merge etc operation.
> Now, whatever the reason was why that client stopped renewing the lease 
> (could be an exception, deadlock, whatever) - that client itself still 
> considers itself as {{active}} and continues to take part in the cluster 
> action.
> This will result in a unbalanced situation where that one client 'sees' 
> everybody as {{active}} while the others see this one as {{inactive}}.
> If this ClusterNodeInfo state should be something that can be built upon, and 
> to avoid any inconsistency due to unbalanced handling, the inactive node 
> should probably retire gracefully - or any other appropriate action should be 
> taken, other than just continuing as today.
> This ticket is to keep track of ideas and actions taken wrt this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to