Stefan Egli created SLING-5285:
----------------------------------

             Summary: more aggressive self-check for heartbeat timeout
                 Key: SLING-5285
                 URL: https://issues.apache.org/jira/browse/SLING-5285
             Project: Sling
          Issue Type: Improvement
          Components: Extensions
    Affects Versions: Discovery Impl 1.2.0
            Reporter: Stefan Egli
            Assignee: Stefan Egli
             Fix For: Discovery Impl 1.2.2


SLING-5195 introduced a self-check that was monitoring if the HeartbeatHandler 
was properly storing the heartbeats regularly. This is done because there are 
different reasons why that might not be the case, eg: the HeartbeatHandler 
could be blocked because of another long-running-commit happening locally - or 
it might be blocked due to thread-pool-exhaustion - or perhaps something yet 
different.

The check was setting off an alarm when the time-since-last-heartbeat was 
bigger than a *heartbeatTimeout*. This however is not sufficient. The 
comparison should be much more aggressive. It should compare against a 
*heartbeatTimeout minus 2 times heartbeatInterval* to have enough safety 
margin. _2 times_ because 1 time is actually the very minimum: this background 
check only _runs_ every heartbeatInterval, so in the worst case it could run 
just _heartbeatInterval_ many seconds before the timeout hits - and still be 
too late by a fraction. So 1 is the very minimum. The _2_ is actually adding a 
safety margin of 1 _heartbeatInterval_ only.

*Note:* this also means that you should configure the heartbeatTimeout at least 
4-5 times the heartbeatInterval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to