Stefan Egli created SLING-5965:
----------------------------------

             Summary: Metrics and a Health-Check for Scheduler to detect 
long-running Quartz-Jobs
                 Key: SLING-5965
                 URL: https://issues.apache.org/jira/browse/SLING-5965
             Project: Sling
          Issue Type: New Feature
          Components: Commons
    Affects Versions: Commons Scheduler 2.5.0
            Reporter: Stefan Egli
            Assignee: Stefan Egli
             Fix For: Commons Scheduler 2.5.2


Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. 
They are served from a thread-pool and should occupy that thread only for a 
short amount of time.

If there are 'misbehaving' quartz-jobs that run for a very long time, they 
start to occupy threads from that thread-pool, thus have an influence on the 
performance of other scheduled/quartz-jobs.

We should have metrics (using 
[sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
 that provide information about internas of Sling Scheduler, such as average, 
max etc duration of scheduled jobs, as well as how many jobs are currently 
running and since when was the oldest job running.

Based on this, a Health-Check can monitor the 'oldest job running' metric and 
flag {{critical}} when eg the oldest job is older than {{60'000ms}} 
(configurable, default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to