Hi all.
I'm chasing an inhibition rules problem and I'm not sure what I'm doing
wrong.
Basically, I'd like to snooze alerting during deployment or maintenance
just because it doesn't make sense to have those when the services are
purposely down. Despite that, alerts notifications keep popping out in
Slack.
For the purpose of inhibiting deploys I've defined the following section in
the alert manager:
inhibit_rules:
- source_match_re:
alertname: deployment_in_progress|maintenance_in_progress
target_match_re:
severity: warning|average|high|disaster
equal: ['stack', 'environment']
When a deploy is started, a metric is pushed via the Pushgateway and one of
the alert above fires. Let's take in account the first one which looks like
that:
- alert: deployment_in_progress
expr: time() - last_deployment{status="started"} < 300
labels:
severity: note
annotations:
In short the deploy alert should last for 5 minutes. The metric is pushed
by several services, as the deploy goes. Hence we could have several alerts
ongoing at increasing times. Severity is "note" so those alerts are never
inhibited. A note message is also delivered to slack, with the desired
"stack" and "environment" values.
So far, so good. Assuming everything is fine there, the problem starts
here, in Slack. Basically, despite the inhibition, notifications about
target down are delivered to Slack. This morning I had the following in
Slack:
<7:17> note deploy firing
<7:17> note deploy firing
<7:22> note deploy firing
<7:22> note deploy firing
<7:23> compound notification for several target down firing <--- this is
incomplete, last alert is cut in half
<7:27> note deploy resolve
<7:27> note deploy resolve
<7:28> compound notification for several target down resolve <--- this is
incomplete, last alert is cut in half
<7:43> note deploy firing
<7:44> compound notification for several target down firing
<7:48> note deploy firing
<7:49> compound notification for several target down firing (and resolves
from before)
<7:53> note deploy resolve
<7:54> compound notification for several target down resolve
I have configured:
group_by: [severity, stack, environment]
group_wait: 30s
group_interval: 5m
Also upness rule is as follows:
alert: target_node_with_source_down
expr: avg_over_time(up{job="node",source=~".+"}[5m]) < 0.9
labels:
severity: average
Is this just a timing issue, i.e. the source alert is reaching prom too
late to be taken in account to avoid the triggering of the alerts or there
could be something else? Thinking about it, could be the effect of
avg_over_time that is spreading the down-ness over time?
This afternoon I had a deploy notification @14:34 and a grouped set of
alerts at @14:38 which is indeed under the 5m span. But I guess the avg
plays a role here. Am I wrong?
Any help much appreciated. Thanks in advance.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/e6b38583-976b-413d-9350-ac31f85816a2%40googlegroups.com.