[jira] [Commented] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs

Chetan Mehrotra (JIRA) Wed, 05 Jul 2017 22:59:39 -0700

    [ 
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075983#comment-16075983
 ]


Chetan Mehrotra commented on SLING-5965:
----------------------------------------

The output looks useful. Some feedback below on implementation

*Quartz Scheduler*

# Would this work reliably? Older stuff would still be running and may be need 
to shutdown as done in deactivate
{noformat}
+    @Modified
+    protected void modified(final BundleContext ctx, final 
QuartzSchedulerConfiguration configuration) {
+        // support modifying configuration without bundle restart
+        activate(ctx, configuration);
+    }
{noformat}
# createTemporaryGauge - Its possible that guage is never read or read after 
long time. This would cause the jobExecutionContext to remain held in guage. 
May be have a way to prune such expired guages explicitly rather than relying 
on Guage invocation

Also may be better to have all this logic moved into a separate class 
(QuartzScheduler is already big!) like JobStatsCollector and have it called 
from within the QuartzJobExecutor and QuartzScheuler. This would have all the 
metrics related logic and core logic would just be making callbacks to it. This 
would simplify adding some basic test for this feature



> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.6.4
>
>         Attachments: numRunningJobs.jpg, oldestRunningJob.jpg, 
> SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt, 
> SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt, timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. 
> They are served from a thread-pool and should occupy that thread only for a 
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they 
> start to occupy threads from that thread-pool, thus have an influence on the 
> performance of other scheduled/quartz-jobs.
> We should have metrics (using 
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
>  that provide information about internas of Sling Scheduler, such as average, 
> max etc duration of scheduled jobs, as well as how many jobs are currently 
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and 
> flag {{critical}} when eg the oldest job is older than {{60'000ms}} 
> (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs

Reply via email to