[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Brian Candler Tue, 09 May 2023 00:55:27 -0700

That's tricky to get exactly right. You could try something like this 
(untested):


    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' 
failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless 
all scrapes failed over 5 minutes.

There is a boundary condition where if the scraping fails for approximately 
5 minutes you're not sure if the standard failure alert would have 
triggered. Hence it might need a bit of tweaking for robustness. To start 
with, just make it over 6 minutes:

    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
    for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard 
alert will have been triggered by then.

I'm still not quite convinced about the "for: 6m" and whether we might lose 
an alert if there were a single failed scrape. Maybe this would be more 
sensitive:

    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
    for: 7m

but I think you might get some spurious alerts at the *end* of a period of 
downtime.

On Tuesday, 9 May 2023 at 02:29:40 UTC+1 Christoph Anton Mitterer wrote:

> Hey.
>
> I have an alert rule like this:
>
> groups:
>   - name:       alerts_general
>     rules:
>     - alert: general_target-down
>       expr: 'up == 0'
>       for:  5m
>
> which is intended to notify about a target instance (respectively a 
> specific exporter on that) being down.
>
> There are also routes in alertmanager.yml which have some "higher" periods 
> for group_wait and group_interval and also distribute that resulting alerts 
> to the various receivers (e.g. depending on the instance that is affected).
>
>
> By chance I've noticed that some of our instances (or the networking) seem 
> to be a bit unstable and every now and so often, a single scrape or some 
> few fail.
>
> Since this does typically not mean that the exporter is down (in the above 
> sense) I wouldn't want that to cause a notification to be sent to people 
> responsible for the respective instances.
> But I would want to get one sent, even if only a single scrape fails, to 
> the local prometheus admin (me ^^), so that I can look further, what causes 
> the scrape failures.
>
>
>
> My (working) solution for that is:
> a) another alert rule like:
> groups:
>   - name:     alerts_general_single-scrapes
>     interval: 15s
>     rules:
>     - alert: general_target-down_single-scrapes      
>       expr: 
> 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
>       for:  0s
>
> (With 15s being the smallest scrape time used by any jobs.)
>
> And a corresponding alertmanager route like:
>   - match:
>       alertname: general_target-down_single-scrapes
>     receiver:       admins_monitoring_no-resolved
>     group_by:       [alertname]
>     group_wait:     0s
>     group_interval: 1s
>
>
> The group_wait: 0s and group_interval: 1s seemed necessary, cause despite 
> of the for: 0s, it seems that alertmanager kind of checks again before 
> actually sending a notification... and when the alert is gone by then 
> (because there was e.g. only one single missing scrape) it wouldn't send 
> anything (despite the alert actually fired).
>
>
> That works so far... that is admins_monitoring_no-resolved get a 
> notification for every single failed scrape while all others only get them 
> when they fail for at least 5m.
>
> I even improved the above a bit, by clearing the alert for single failed 
> scrapes, when the one for long-term down starts firing via something like:
>       expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} 
> == 0 )  unless on (instance,job)  ( ALERTS{alertname="general_target-down", 
> alertstate="firing"} == 1 )'
>
>
> I wondered wheter this can be done better?
>
> Ideally I'd like to get notification for 
> general_target-down_single-scrapes only sent, if there would be no one for 
> general_target-down.
>
> That is, I don't care if the notification comes in late (by the above ~ 
> 5m), it just *needs* to come, unless - of course - the target is "really" 
> down (that is when general_target-down fires), in which case no 
> notification should go out for general_target-down_single-scrapes.
>
>
> I couldn't think of an easy way to get that. Any ideas?
>
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/237fda1f-89ce-419a-a54f-b9b12ea4d593n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to