[jira] [Commented] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs

Chetan Mehrotra (JIRA) Tue, 16 Aug 2016 03:23:02 -0700

    [ 
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422553#comment-15422553
 ]


Chetan Mehrotra commented on SLING-5965:
----------------------------------------

Looks useful!. Couple of points
{noformat}
+        final Counter runningJobsCounter = metricsService == null ? null : 
metricsService.counter(QuartzScheduler.METRICS_NAME_RUNNING_QUARTZJOBS);
+        final Timer jobDurationTimer = metricsService == null ? null : 
metricsService.timer(QuartzScheduler.METRICS_NAME_QUARTZJOBS_DURATION);
{noformat}
Instead of all those null checks you can just fallback to 
{{MetricsService#NOOP}}. This would make code cleaner

* For collecting job runtime it would be better to make use of 
[JobListener|http://www.quartz-scheduler.org/documentation/quartz-2.1.x/cookbook/JobListeners.html]
 where you can get execution of time of any fired job via 
{{JobExecutionContext#getJobRunTime}}
* We can look into exposing 
[QuartzSchedulerMBean|http://www.quartz-scheduler.org/api/2.1.7/org/quartz/core/jmx/QuartzSchedulerMBean.html].
 Probably some methods would need to be disabled like those around adding job 
(but might be fine also)
* Direct dependency on MetricRegistry should be avoided. If guage support is 
required we can add an abstraction for that in Commons Metrics

> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.5.2
>
>         Attachments: SLING-5965.patch
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. 
> They are served from a thread-pool and should occupy that thread only for a 
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they 
> start to occupy threads from that thread-pool, thus have an influence on the 
> performance of other scheduled/quartz-jobs.
> We should have metrics (using 
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
>  that provide information about internas of Sling Scheduler, such as average, 
> max etc duration of scheduled jobs, as well as how many jobs are currently 
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and 
> flag {{critical}} when eg the oldest job is older than {{60'000ms}} 
> (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs

Reply via email to