[
https://issues.apache.org/jira/browse/AIRFLOW-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747058#comment-16747058
]
ASF subversion and git services commented on AIRFLOW-3177:
----------------------------------------------------------
Commit 4740da13d7432c41bc091bd9e271322b29933eef in airflow's branch
refs/heads/v1-10-test from Greg Neiheisel
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4740da1 ]
[AIRFLOW-3177] Change scheduler_heartbeat from gauge to counter (#4027)
This updates the scheduler_heartbeat metric from a gauge to a counter to
better support the statsd_exporter for usage with Prometheus. A counter
allows users to track the rate of the heartbeat, and integrates with the
exporter better. A crashing or down scheduler will no longer emit the
metric, but the statsd_exporter will continue to show a 1 for the metric
value. This fixes that issue because a counter will continually change,
and the lack of change indicates an issue with the scheduler.
Add statsd change notice in UPDATING.md
> Change scheduler_heartbeat metric from gauge to counter
> -------------------------------------------------------
>
> Key: AIRFLOW-3177
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3177
> Project: Apache Airflow
> Issue Type: Improvement
> Components: scheduler
> Affects Versions: 2.0.0
> Reporter: Greg Neiheisel
> Assignee: Greg Neiheisel
> Priority: Minor
> Fix For: 1.10.1
>
>
> Currently, the scheduler_heartbeat metric exposed with the statsd integration
> is a gauge. I'm proposing to change the gauge to a counter for a better
> integration with Prometheus via the
> [statsd_exporter|[https://github.com/prometheus/statsd_exporter].]
> Rather than pointing Airflow at an actual statsd server, you can point it at
> this exporter, which will accumulate the metrics and expose them to be
> scraped by Prometheus at /metrics. The problem is that once this value is set
> when the scheduler runs its first loop, it will always be exposed to
> Prometheus as 1. The scheduler can crash, or be turned off and the statsd
> exporter will report a 1 until it is restarted and rebuilds its internal
> state.
> By turning this metric into a counter, we can detect an issue with the
> scheduler by graphing and alerting using a rate. If the rate of change of the
> counter drops below what it should be at (determined by the
> scheduler_heartbeat_secs setting), we can fire an alert.
> This should be helpful for adoption in Kubernetes environments where
> Prometheus is pretty much the standard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)