Hello.
I figure out that the notifications were not delivered to Slack in a
perfectly aligned fashion because of the way the inhibition metric was
pushed and how grouping was setup. Revisiting grouping helped better align
the Slack notifications to the actual snoozing window and fixed the issue.
Thanks to anyone who looked into this issue.
Bests,
F.
On Thursday, 5 March 2020 15:42:13 UTC+1, Federico Buti wrote:
>
> Hi all.
>
> I'm chasing an inhibition rules problem and I'm not sure what I'm doing
> wrong.
> Basically, I'd like to snooze alerting during deployment or maintenance
> just because it doesn't make sense to have those when the services are
> purposely down. Despite that, alerts notifications keep popping out in
> Slack.
>
> For the purpose of inhibiting deploys I've defined the following section
> in the alert manager:
>
> inhibit_rules:
> - source_match_re:
> alertname: deployment_in_progress|maintenance_in_progress
> target_match_re:
> severity: warning|average|high|disaster
> equal: ['stack', 'environment']
>
>
> When a deploy is started, a metric is pushed via the Pushgateway and one
> of the alert above fires. Let's take in account the first one which looks
> like that:
>
> - alert: deployment_in_progress
> expr: time() - last_deployment{status="started"} < 300
> labels:
> severity: note
> annotations:
>
>
> In short the deploy alert should last for 5 minutes. The metric is pushed
> by several services, as the deploy goes. Hence we could have several alerts
> ongoing at increasing times. Severity is "note" so those alerts are never
> inhibited. A note message is also delivered to slack, with the desired
> "stack" and "environment" values.
>
> So far, so good. Assuming everything is fine there, the problem starts
> here, in Slack. Basically, despite the inhibition, notifications about
> target down are delivered to Slack. This morning I had the following in
> Slack:
>
> <7:17> note deploy firing
> <7:17> note deploy firing
> <7:22> note deploy firing
> <7:22> note deploy firing
> <7:23> compound notification for several target down firing <--- this is
> incomplete, last alert is cut in half
> <7:27> note deploy resolve
> <7:27> note deploy resolve
> <7:28> compound notification for several target down resolve <--- this is
> incomplete, last alert is cut in half
> <7:43> note deploy firing
> <7:44> compound notification for several target down firing
> <7:48> note deploy firing
> <7:49> compound notification for several target down firing (and resolves
> from before)
> <7:53> note deploy resolve
> <7:54> compound notification for several target down resolve
>
> I have configured:
>
> group_by: [severity, stack, environment]
> group_wait: 30s
> group_interval: 5m
>
>
> Also upness rule is as follows:
>
> alert: target_node_with_source_down
> expr: avg_over_time(up{job="node",source=~".+"}[5m]) < 0.9
> labels:
> severity: average
>
>
> Is this just a timing issue, i.e. the source alert is reaching prom too
> late to be taken in account to avoid the triggering of the alerts or there
> could be something else? Thinking about it, could be the effect of
> avg_over_time that is spreading the down-ness over time?
>
> This afternoon I had a deploy notification @14:34 and a grouped set of
> alerts at @14:38 which is indeed under the 5m span. But I guess the avg
> plays a role here. Am I wrong?
>
> Any help much appreciated. Thanks in advance.
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/c4a3baaf-fb11-4318-a219-bd5fdc4074e9o%40googlegroups.com.