Stefan Egli created SLING-5285:
----------------------------------
Summary: more aggressive self-check for heartbeat timeout
Key: SLING-5285
URL: https://issues.apache.org/jira/browse/SLING-5285
Project: Sling
Issue Type: Improvement
Components: Extensions
Affects Versions: Discovery Impl 1.2.0
Reporter: Stefan Egli
Assignee: Stefan Egli
Fix For: Discovery Impl 1.2.2
SLING-5195 introduced a self-check that was monitoring if the HeartbeatHandler
was properly storing the heartbeats regularly. This is done because there are
different reasons why that might not be the case, eg: the HeartbeatHandler
could be blocked because of another long-running-commit happening locally - or
it might be blocked due to thread-pool-exhaustion - or perhaps something yet
different.
The check was setting off an alarm when the time-since-last-heartbeat was
bigger than a *heartbeatTimeout*. This however is not sufficient. The
comparison should be much more aggressive. It should compare against a
*heartbeatTimeout minus 2 times heartbeatInterval* to have enough safety
margin. _2 times_ because 1 time is actually the very minimum: this background
check only _runs_ every heartbeatInterval, so in the worst case it could run
just _heartbeatInterval_ many seconds before the timeout hits - and still be
too late by a fraction. So 1 is the very minimum. The _2_ is actually adding a
safety margin of 1 _heartbeatInterval_ only.
*Note:* this also means that you should configure the heartbeatTimeout at least
4-5 times the heartbeatInterval.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)