> As a reminder, my goal was:
> - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
>   how Icinga would put the host into down state, after pings failed or a
>   number of seconds)
> - but even if a single scrape fails (which alone wouldn't trigger the above
>   alert) I'd like to get a notification (telling me, that something might be
>   fishy with the networking or so), that is UNLESS that single failed scrape
>   is part of a sequence of failed scrapes that also caused / will cause the
>   above target-down alert
>
> Assuming in the following, each number is a sample value with ~10s distance 
> for
> the `up` metric of a single host, with the most recent one being the 
> right-most:
> - 1 1 1 1 1 1 1 => should give nothing
> - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single 
> failure,
>                    or develop into the target-down alert)
> - 1 1 1 1 1 0 0 => same as above, not clear yet
> ...
> - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

One thing you can look into here for detecting and counting failed
scrapes is resets(). This works perfectly well when applied to a gauge
that is 1 or 0, and in this case it will count the number of times the
metric went from 1 to 0 in a particular time interval. You can similarly
use changes() to count the total number of transitions (either 1->0
scrape failures or 0->1 scrapes starting to succeed after failures).
It may also be useful to multiply the result of this by the current
value of the metric, so for example:

        resets(up{..}[1m]) * up{..}

will be non-zero if there have been some number of scrape failures over
the past minute *but* the most recent scrape succeeded (if that scrape
failed, you're multiplying resets() by zero and getting zero). You can
then wrap this in an '(...) > 0' to get something you can maybe use as
an alert rule for the 'scrapes failed' notification. You might need to
make the range for resets() one step larger than you use for the
'target-down' alert, since resets() will also be zero if up{...} was
zero all through its range.

(At this point you may also want to look at the alert 'keep_firing_for'
setting.)

However, my other suggestion here would be that this notification or
count of failed scrapes may be better handled as a dashboard or a
periodic report (from a script) instead of through an alert, especially
a fast-firing alert. I think it will be relatively difficult to make an
alert give you an accurate count of how many times this happened; if you
want such a count to make decisions, a dashboard (possibly visualizing
the up/down blips) or a report could be better. A program is also in the
position to extract the raw up{...} metrics (with timestamps) and then
readily analyze them for things like how long the failed scrapes tend to
last for, how frequently they happen, etc etc.

        - cks
PS: This is not my clever set of tricks, I got it from other people.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3652072.1710729628%40apps0.cs.toronto.edu.

Reply via email to