[ 
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125442#comment-16125442
 ] 

Stefan Egli commented on SLING-5965:
------------------------------------

The Metrics and HC implementations have been written to use 'as much commons 
metrics' as possible - but without a MetricRegistry equivalent in commons 
metrics the HC could not have been written. Which is why 
{{SchedulerHealthCheck}} is currently dependent on dropwizard API. (funny 
enough, if we had split the HC into its own bundle I guess things would have 
looked perfectly fine).

The question with switching directly to dropwizard for collecting the metrics 
(as is done in the patch) is, whether we would not loose functionality the 
commons metrics provides (some of which are mentioned [in 
SLING-7043|https://issues.apache.org/jira/browse/SLING-7043?focusedCommentId=16125306&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16125306]).
 I was under the assumption that it is a good thing to _collect_ the metrics 
via commons metrics - but a necessary thing to _evaluate_ metrics via 
dropwizard (as that was the only possible way). But I also see that the 
resulting dual dependency can be seen as a negative.

So I guess there are two solutions:
a) the proposed patch which switches to only using dropwizard API
b) adding support for a MetricRegistry equivalent to commons metrics, which the 
SchedulerHealthCheck could then use.

> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.6.4
>
>         Attachments: numRunningJobs.jpg, oldestRunningJob.jpg, patch.txt, 
> SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt, 
> SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt, SLING-5965.v5.patch.txt, 
> timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. 
> They are served from a thread-pool and should occupy that thread only for a 
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they 
> start to occupy threads from that thread-pool, thus have an influence on the 
> performance of other scheduled/quartz-jobs.
> We should have metrics (using 
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
>  that provide information about internas of Sling Scheduler, such as average, 
> max etc duration of scheduled jobs, as well as how many jobs are currently 
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and 
> flag {{critical}} when eg the oldest job is older than {{60'000ms}} 
> (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to