Hi all.

I'm chasing an inhibition rules problem and I'm not sure what I'm doing 
wrong. 
Basically, I'd like to snooze alerting during deployment or maintenance 
just because it doesn't make sense to have those when the services are 
purposely down. Despite that, alerts notifications keep popping out in 
Slack. 

For the purpose of inhibiting deploys I've defined the following section in 
the alert manager:

inhibit_rules:
- source_match_re:
    alertname: deployment_in_progress|maintenance_in_progress
  target_match_re:
    severity: warning|average|high|disaster
  equal: ['stack', 'environment']


When a deploy is started, a metric is pushed via the Pushgateway and one of 
the alert above fires. Let's take in account the first one which looks like 
that:

- alert: deployment_in_progress
  expr: time() - last_deployment{status="started"} < 300
  labels:
    severity: note
  annotations:


In short the deploy alert should last for 5 minutes. The metric is pushed 
by several services, as the deploy goes. Hence we could have several alerts 
ongoing at increasing times. Severity is "note" so those alerts are never 
inhibited. A note message is also delivered to slack, with the desired 
"stack" and "environment" values.

So far, so good. Assuming everything is fine there, the problem starts 
here, in Slack. Basically, despite the inhibition, notifications about 
target down are delivered to Slack. This morning I had the following in 
Slack:

<7:17> note deploy firing
<7:17> note deploy firing
<7:22> note deploy firing
<7:22> note deploy firing
<7:23> compound notification for several target down firing <--- this is 
incomplete, last alert is cut in half
<7:27> note deploy resolve
<7:27> note deploy resolve
<7:28> compound notification for several target down resolve <--- this is 
incomplete, last alert is cut in half
<7:43> note deploy firing
<7:44> compound notification for several target down firing
<7:48> note deploy firing
<7:49> compound notification for several target down firing (and resolves 
from before)
<7:53> note deploy resolve
<7:54> compound notification for several target down resolve

I have configured:

group_by: [severity, stack, environment]
group_wait: 30s
group_interval: 5m


Also upness rule is as follows:

alert: target_node_with_source_down
  expr: avg_over_time(up{job="node",source=~".+"}[5m]) < 0.9
  labels:
    severity: average


Is this just a timing issue, i.e. the source alert is reaching prom too 
late to be taken in account to avoid the triggering of the alerts or there 
could be something else? Thinking about it, could be the effect of 
avg_over_time that is spreading the down-ness over time? 

This afternoon I had a deploy notification @14:34 and a grouped set of 
alerts at @14:38 which is indeed under the 5m span. But I guess the avg 
plays a role here. Am I wrong?

Any help much appreciated. Thanks in advance. 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e6b38583-976b-413d-9350-ac31f85816a2%40googlegroups.com.

Reply via email to