[ 
https://issues.apache.org/jira/browse/OAK-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745194#comment-14745194
 ] 

Stefan Egli commented on OAK-3238:
----------------------------------

Note that with OAK-3398 and OAK-3399 incorporated in 1.3.7 the timing will be 
as follows:
* lease is checked every 1sec (as usual)
* lease is updated after 10sec
* lease is valid for 120sec
* when lease is valid only for another 20sec a 5sec retry loop ('one final 
chance') starts, after this - so 15sec before lease end - the local instance 
declares 'lease failure' and takes appropriate steps (default: stop oak-core 
bundle)

taking 4-10sec clock accuracy into account this would result in last possible 
write operation from a faulty instance to happen 5sec before lease timeout from 
the point of view of all other instances.

> fine tune clock-sync check vs lease-check settings
> --------------------------------------------------
>
>                 Key: OAK-3238
>                 URL: https://issues.apache.org/jira/browse/OAK-3238
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3.4
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: 1.3.5
>
>
> There are now two components that try to assure 'discovery-lite' (OAK-2844) 
> is reporting a coherent cluster view to the upper layers:
> * OAK-2682 : time difference detection: by default fails if clock is off by 
> more than 2 seconds at startup. That results in a 4 sec max margin in a 
> document-cluster
> * OAK-2739 : lease-checking: every instance checks if the local lease is 
> valid upon any document access. This check is done against the actual 
> 'leaseEndTime' - which is updated every (by default) 30 seconds to be valid 
> for (by default) another 60 seconds.
> These two factors combined, in the worst case you could still end up having 
> that 4 second time window where the local instance fails to update the lease 
> (eg lease-thread dies) but it considers itself still owning a valid lease - 
> while a remote instance might be those 4 seconds off and considers the lease 
> as timed out.
> So overall: the 3 factors 'lease duration', 'lease update frequency' and 
> 'maximum allowed clock difference' must be better tuned to end up in a stable 
> mechanism.
> Suggestion:
>  * increase the 'lease duration' to be 3 x 'lease update frequency', ie 90sec 
> lease duration
> * reduce the lease check failure limit from 'lease duration' to 2x 'lease 
> update frequency' - assuming that one 'lease update interval' is way larger 
> than the 'maximum allowed clock difference'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to