I usually recommend throwing out any "But this is how Icinga does it".
thinking.

The way we do things in Prometheus for this kind of thing is to simply
think about "availability".

For any scrape failures:

    avg_over_time(up[5m]) < 1

For more than one scrape failure (assuming 15s intervals)

    avg_over_time(up[5m]) < 0.95

This is a much easier way to think about "uptime".

Also, if you want, there is the new "keep_firing_for" alerting option.

On Mon, Mar 18, 2024 at 5:45 AM Christoph Anton Mitterer <cales...@gmail.com>
wrote:

> Hey Chris.
>
> On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
> >
> > One thing you can look into here for detecting and counting failed
> > scrapes is resets(). This works perfectly well when applied to a
> > gauge
>
> Though it is documented as to be only used with counters... :-/
>
>
> > that is 1 or 0, and in this case it will count the number of times
> > the
> > metric went from 1 to 0 in a particular time interval. You can
> > similarly
> > use changes() to count the total number of transitions (either 1->0
> > scrape failures or 0->1 scrapes starting to succeed after failures).
>
> The idea sounds promising... especially to also catch cases like that
> 8a, I've mentioned in my previous mail and where the
> {min,max}_over_time approach seems to fail.
>
>
> > It may also be useful to multiply the result of this by the current
> > value of the metric, so for example:
> >
> >       resets(up{..}[1m]) * up{..}
> >
> > will be non-zero if there have been some number of scrape failures
> > over
> > the past minute *but* the most recent scrape succeeded (if that
> > scrape
> > failed, you're multiplying resets() by zero and getting zero). You
> > can
> > then wrap this in an '(...) > 0' to get something you can maybe use
> > as
> > an alert rule for the 'scrapes failed' notification. You might need
> > to
> > make the range for resets() one step larger than you use for the
> > 'target-down' alert, since resets() will also be zero if up{...} was
> > zero all through its range.
> >
> > (At this point you may also want to look at the alert
> > 'keep_firing_for'
> > setting.)
>
> I will give that some more thinking and reply back if I should find
> some way to make an alert out of this.
>
> Well and probably also if I fail to ^^ ... at least at a first glance I
> wasn't able to use that to create and alert that would behave as
> desired. :/
>
>
> > However, my other suggestion here would be that this notification or
> > count of failed scrapes may be better handled as a dashboard or a
> > periodic report (from a script) instead of through an alert,
> > especially
> > a fast-firing alert.
>
> Well the problem with a dashboard would IMO be, that someone must
> actually look at it or otherwise it would be pointless. ;-)
>
> Not really sure how to do that with a script (which I guess would be
> conceptually similar to an alert... just that it's sent e.g. weekly).
>
> I guess I'm not so much interested in the exact times, when single
> scrapes fail (I cannot correct it retrospectively anyway) but just
> *that* it happens and that I have to look into it.
>
> My assumption kinda is, that normally scrapes aren't lost. So I would
> really only get an alert mail if something's wrong.
> And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could
> still reduce mail but on the alertmanager level?
>
>
> > I think it will be relatively difficult to make an
> > alert give you an accurate count of how many times this happened; if
> > you
> > want such a count to make decisions, a dashboard (possibly
> > visualizing
> > the up/down blips) or a report could be better. A program is also in
> > the
> > position to extract the raw up{...} metrics (with timestamps) and
> > then
> > readily analyze them for things like how long the failed scrapes tend
> > to
> > last for, how frequently they happen, etc etc.
>
> Well that sounds to be quite some effort... and I already think that my
> current approaches required far too much of an effort (and still don't
> fully work ^^).
> As said... despite not really being comparable to Prometheus: in
> Incinga a failed sensor probe would be immediately noticeable.
>
>
> Thanks,
> Chris.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmomwear40zO_Zs10OOePVWCdmfdSfOEyffxeOk4HtzAzA%40mail.gmail.com.

Reply via email to