Hey.

I have an alert rule like this:

groups:
  - name:       alerts_general
    rules:
    - alert: general_target-down
      expr: 'up == 0'
      for:  5m

which is intended to notify about a target instance (respectively a 
specific exporter on that) being down.

There are also routes in alertmanager.yml which have some "higher" periods 
for group_wait and group_interval and also distribute that resulting alerts 
to the various receivers (e.g. depending on the instance that is affected).


By chance I've noticed that some of our instances (or the networking) seem 
to be a bit unstable and every now and so often, a single scrape or some 
few fail.

Since this does typically not mean that the exporter is down (in the above 
sense) I wouldn't want that to cause a notification to be sent to people 
responsible for the respective instances.
But I would want to get one sent, even if only a single scrape fails, to 
the local prometheus admin (me ^^), so that I can look further, what causes 
the scrape failures.



My (working) solution for that is:
a) another alert rule like:
groups:
  - name:     alerts_general_single-scrapes
    interval: 15s
    rules:
    - alert: general_target-down_single-scrapes      
      expr: 
'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
      for:  0s

(With 15s being the smallest scrape time used by any jobs.)

And a corresponding alertmanager route like:
  - match:
      alertname: general_target-down_single-scrapes
    receiver:       admins_monitoring_no-resolved
    group_by:       [alertname]
    group_wait:     0s
    group_interval: 1s


The group_wait: 0s and group_interval: 1s seemed necessary, cause despite 
of the for: 0s, it seems that alertmanager kind of checks again before 
actually sending a notification... and when the alert is gone by then 
(because there was e.g. only one single missing scrape) it wouldn't send 
anything (despite the alert actually fired).


That works so far... that is admins_monitoring_no-resolved get a 
notification for every single failed scrape while all others only get them 
when they fail for at least 5m.

I even improved the above a bit, by clearing the alert for single failed 
scrapes, when the one for long-term down starts firing via something like:
      expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} 
== 0 )  unless on (instance,job)  ( ALERTS{alertname="general_target-down", 
alertstate="firing"} == 1 )'


I wondered wheter this can be done better?

Ideally I'd like to get notification for general_target-down_single-scrapes 
only sent, if there would be no one for general_target-down.

That is, I don't care if the notification comes in late (by the above ~ 
5m), it just *needs* to come, unless - of course - the target is "really" 
down (that is when general_target-down fires), in which case no 
notification should go out for general_target-down_single-scrapes.


I couldn't think of an easy way to get that. Any ideas?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8af3cd3e-f3b9-4c0c-b799-ac7a420d8bb1n%40googlegroups.com.

Reply via email to