Hey. I have an alert rule like this:
groups: - name: alerts_general rules: - alert: general_target-down expr: 'up == 0' for: 5m which is intended to notify about a target instance (respectively a specific exporter on that) being down. There are also routes in alertmanager.yml which have some "higher" periods for group_wait and group_interval and also distribute that resulting alerts to the various receivers (e.g. depending on the instance that is affected). By chance I've noticed that some of our instances (or the networking) seem to be a bit unstable and every now and so often, a single scrape or some few fail. Since this does typically not mean that the exporter is down (in the above sense) I wouldn't want that to cause a notification to be sent to people responsible for the respective instances. But I would want to get one sent, even if only a single scrape fails, to the local prometheus admin (me ^^), so that I can look further, what causes the scrape failures. My (working) solution for that is: a) another alert rule like: groups: - name: alerts_general_single-scrapes interval: 15s rules: - alert: general_target-down_single-scrapes expr: 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0' for: 0s (With 15s being the smallest scrape time used by any jobs.) And a corresponding alertmanager route like: - match: alertname: general_target-down_single-scrapes receiver: admins_monitoring_no-resolved group_by: [alertname] group_wait: 0s group_interval: 1s The group_wait: 0s and group_interval: 1s seemed necessary, cause despite of the for: 0s, it seems that alertmanager kind of checks again before actually sending a notification... and when the alert is gone by then (because there was e.g. only one single missing scrape) it wouldn't send anything (despite the alert actually fired). That works so far... that is admins_monitoring_no-resolved get a notification for every single failed scrape while all others only get them when they fail for at least 5m. I even improved the above a bit, by clearing the alert for single failed scrapes, when the one for long-term down starts firing via something like: expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} == 0 ) unless on (instance,job) ( ALERTS{alertname="general_target-down", alertstate="firing"} == 1 )' I wondered wheter this can be done better? Ideally I'd like to get notification for general_target-down_single-scrapes only sent, if there would be no one for general_target-down. That is, I don't care if the notification comes in late (by the above ~ 5m), it just *needs* to come, unless - of course - the target is "really" down (that is when general_target-down fires), in which case no notification should go out for general_target-down_single-scrapes. I couldn't think of an easy way to get that. Any ideas? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8af3cd3e-f3b9-4c0c-b799-ac7a420d8bb1n%40googlegroups.com.