[prometheus-users] Re: Patterns to expose absent metrics when 0 is meaningful.

Brian Candler Sun, 29 Aug 2021 01:22:41 -0700

On Thursday, 26 August 2021 at 05:14:04 UTC+1 Dorian Jaminais-Grellier 
wrote:


> I could make the metrics disappear completely from my /metrics endpoint 
> but I understand this is frown upon but it would have the advantage of 
> being very clear to users that the data is missing.
>
>
I'd say it's not exactly frowned upon.  It can make it more difficult to 
alert on this condition, but it's doable, either by joining to another 
timeseries that has all the labels that you expect to see (using 'and' or 
'unless'), or by joining to itself in conjunction with a time offset (e.g. 
alert when timeseries existed 10 minutes ago but doesn't exist now).

https://www.robustperception.io/absent-alerting-for-jobs
https://www.robustperception.io/using-time-series-as-alert-thresholds

The traditional way to handle this is to have a separate metric 
representing whether or not temperature was collected successfully - 
comparable to "up" in regular scraping, or "probe_success" in 
blackbox_exporter.  This assumes that you are able to scrape, and the 
exporter is able to say explicitly "I could not talk to the temperature 
sensor", or "I talked to the temperature sensor, but it had no value to 
give to me".  In that case, the value 0 or 1 tells you whether there's a 
problem with temperature collection or not; the main metric can either 
vanish, or report the last-known value, whichever is more useful to you.

However it sounds like rather than scraping, you're using something like 
pushgateway to get the last reported value.  In that case, the reporting of 
the temperature (to the push gateway) is not synchronous with the scraping 
of the data (from the push gateway).  In that case, it depends on what 
failure modes you're trying to deal with.  If the issue is "temperature 
probe is broken, but I'm able to report that it's broken" then it can push 
a separate metric saying success/fail.  But if it just goes offline or 
stops pushing data, that doesn't help you.

In that case, a separate metric with timestamp of last push is the safest 
approach, but as you suggest, you need to process this somewhat to make it 
more useful.  You could have a recording rule to synthesise a status value, 
i.e. it stores a value of 1 if the push timestamp is "fresh enough" and 0 
if it hasn't been seen for longer than some threshold.

Or you can make a pushgateway which has a TTL that expires the metric; 
that's a feature that has been requested but rejected for the standard 
pushgateway, so you may find it useful to read the relevant issue threads 
to understand why it's considered a bad idea.

https://github.com/prometheus/pushgateway/issues/19
https://github.com/prometheus/pushgateway/issues/117

I did find a fork with TTL: https://github.com/dinumathai/pushgateway

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3e07768-fe76-4153-be93-fdee15c5a788n%40googlegroups.com.

[prometheus-users] Re: Patterns to expose absent metrics when 0 is meaningful.

Reply via email to