[
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073834#comment-16073834
]
Stefan Egli edited comment on SLING-5965 at 7/4/17 3:41 PM:
------------------------------------------------------------
Attached [^SLING-5965.v3.patch.txt]
h4. metrics
* the following metrics exist:
** number of currently running jobs
** oldest currently running job - if one is above a threshold (1000ms by
default) and it creates a temporary gauge for just that slow one, indicating
the name of the slow job
** timers over all jobs
* all of the above is done
** grouped by thread pool name
** grouped by a configurable filter (to separate certain known slow or frequent
jobs for example)
** grouped by slow jobs (auto-detected and auto-created when hit)
h4. number of running jobs metrics example
!numRunningJobs.jpg|width=640!
h4. oldest running job metrics example
!oldestRunningJob.jpg|width=640!
h4. timers metrics example
!timers.jpg|width=640!
h4. Scheduler Health Check
There's a scheduler health check which does the following:
* if there are 0 running jobs it's all green
* if there are 1 or more running jobs it checks how old the oldest one is
* if the oldest is older than what's configured (60000ms by default) then this
health-check becomes red and it tries to extract more infos as to which job is
slow. it does that by listing all
{{sling:commons.scheduler.oldest.running.job.millis.slow.}} gauges and shows
for each how old it is (these {{slow}} gauges are auto-created when accessing
any of the other {{sling:commons.scheduler.oldest.running.job.millis.}} gauges).
!SchedulerHealthCheck.jpg|width=640!
reviews very welcome, /cc [~chetanm], [~cziegeler]
was (Author: egli):
Attached [^SLING-5965.v3.patch.txt]
h4. metrics
* the following metrics exist:
** number of currently running jobs
** oldest currently running job - if one is above a threshold (1000ms by
default) and it creates a temporary gauge for just that slow one, indicating
the name of the slow job
** timers over all jobs
* all of the above is done
** grouped by thread pool name
** grouped by a configurable filter (to separate certain known slow or frequent
jobs for example)
** grouped by slow jobs (auto-detected and auto-created when hit)
h4. number of running jobs metrics example
!numRunningJobs.tiff|thumbnail!
h4. oldest running job metrics example
!oldestRunningJob.tiff|thumbnail!
h4. timers metrics example
!timers.tiff|thumbnail!
h4. Scheduler Health Check
There's a scheduler health check which does the following:
* if there are 0 running jobs it's all green
* if there are 1 or more running jobs it checks how old the oldest one is
* if the oldest is older than what's configured (60000ms by default) then this
health-check becomes red and it tries to extract more infos as to which job is
slow. it does that by listing all
{{sling:commons.scheduler.oldest.running.job.millis.slow.}} gauges and shows
for each how old it is (these {{slow}} gauges are auto-created when accessing
any of the other {{sling:commons.scheduler.oldest.running.job.millis.}} gauges).
!SchedulerHealthCheck.tiff|thumbnail!
reviews very welcome, /cc [~chetanm], [~cziegeler]
> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
> Key: SLING-5965
> URL: https://issues.apache.org/jira/browse/SLING-5965
> Project: Sling
> Issue Type: New Feature
> Components: Commons
> Affects Versions: Commons Scheduler 2.5.0
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Fix For: Commons Scheduler 2.6.4
>
> Attachments: numRunningJobs.jpg, oldestRunningJob.jpg,
> SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt,
> SLING-5965.v3.patch.txt, timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs.
> They are served from a thread-pool and should occupy that thread only for a
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they
> start to occupy threads from that thread-pool, thus have an influence on the
> performance of other scheduled/quartz-jobs.
> We should have metrics (using
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
> that provide information about internas of Sling Scheduler, such as average,
> max etc duration of scheduled jobs, as well as how many jobs are currently
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and
> flag {{critical}} when eg the oldest job is older than {{60'000ms}}
> (configurable, default).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)