[ 
https://issues.apache.org/jira/browse/OAK-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645620#comment-14645620
 ] 

Stefan Egli commented on OAK-2739:
----------------------------------

{quote}Currently background lease is updated periodically (every 1 sec) by a 
dedicated thread which just perform a single operation and not much. So even if 
there are issues in other parts this thread would continue to work (which might 
be wrong) and still update the lease every 1 sec.{quote}
That's not entirely correct: the least is updated only every {{leaseTime / 2}} 
- by default every 30 sec. It _checks_ every 1 sec but will only _update_ it 
every 30 sec.

{quote}So to me lease update does not look like an operation which would take 
long time and cause above mentioned issues. May be I am missing something 
here{quote}
Generally speaking there are many reasons why one particular instance would not 
update a lease in time:
# because it crashed
# because it can't talk to mongo anymore
# because the process was halted and continued (eg open/close laptop, process 
started in fg - Ctrl-Z, kill -STOP, other terminal surprises)
# because the memory is very low, thus very long GC cycles, preventing much 
from happening in the VM
# because the {{BackgroundLeaseUpdate}} task for some reason died
# because the {{BackgroundLeaseUpdate}} task for some reason is halted or runs 
into a deadlock
# because of something I forgot or some other yet to find-out VM mystery

In any case, the other instances have no way of figuring out the exact reason 
and they can only assume that 1. happened. And if that's not the case, then 
this ticket is about finding a way that prevents the instance from continuing 
should it not be able to update the lease within eg 30sec. I think it's fair to 
demand that an instance is always capable of updating the lease every 30sec and 
if it can't, then it shall remain silent once and for all. I'm not saying it is 
a situation that is likely to occur very frequently - but if we're to build a 
reliable system then this part imv is a critical part of it.

> take appropriate action when lease cannot be renewed (in time)
> --------------------------------------------------------------
>
>                 Key: OAK-2739
>                 URL: https://issues.apache.org/jira/browse/OAK-2739
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: mongomk
>    Affects Versions: 1.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>              Labels: resilience
>             Fix For: 1.3.5
>
>
> Currently, in an oak-cluster when (e.g.) one oak-client stops renewing its 
> lease (ClusterNodeInfo.renewLease()), this will be eventually noticed by the 
> others in the same oak-cluster. Those then mark this client as {{inactive}} 
> and start recoverying and subsequently removing that node from any further 
> merge etc operation.
> Now, whatever the reason was why that client stopped renewing the lease 
> (could be an exception, deadlock, whatever) - that client itself still 
> considers itself as {{active}} and continues to take part in the cluster 
> action.
> This will result in a unbalanced situation where that one client 'sees' 
> everybody as {{active}} while the others see this one as {{inactive}}.
> If this ClusterNodeInfo state should be something that can be built upon, and 
> to avoid any inconsistency due to unbalanced handling, the inactive node 
> should probably retire gracefully - or any other appropriate action should be 
> taken, other than just continuing as today.
> This ticket is to keep track of ideas and actions taken wrt this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to