Hi!
I think this is mostly a configuration issue so i'm posting this here
before github to see if someone can help me.
I have a prometheus server configured in kubernetes with 2 alertmanagers in
HA. (1 prometheus server and 2 AlertManagers).
Alertmanager Configuration:
================================================
# Deployment relevant bits
prometheus-alertmanager:
Image: prom/alertmanager:v0.19.0
Port: 9093/TCP
Host Port: 0/TCP
Args:
--config.file=/etc/config/alertmanager.yml
--storage.path=/data
--log.level=debug
--cluster.settle-timeout=2m
--cluster.listen-address=0.0.0.0:19604
================================================
# Configmap relevant bits
receivers:
(...)
route:
group_wait: 120s
group_interval: 5m
receiver: default-receiver
repeat_interval: 168h
group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname',
'objectid', 'alertid', 'resourceid']
routes:
- match:
severity: blackhole
receiver: blackhole
continue: false
- match:
tag: "source_tag"
receiver: blackhole
repeat_interval: 1m
group_interval: 1m
continue: false
(...)
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
- source_match:
tag: "source_tag"
target_match:
tag: "target_tag"
Inhibition rules work like a charm, until one of the alertmanager dies. If
a node on the cluster dies, one of the alertmanager pods needs to be
realocated and restarts. When it restarts, we can see on the log file that
the alert with the tag "tag: 'target_tag'", is received before the source
tag one and the alert is fired.
*Example: *
We have an alert in Prometheus that fires between 10 and 12 AM. While this
alert is firing i want all the alerts that match some label (in this case
tag: target_tag) to be inhibited. This approach works flawlessly unless the
alertmanager is restarted and I can see on the logs:
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:217 msg="Starting
Alertmanager" version="(version=0.19.0, branch=HEAD,
revision=7aa5d19fea3f58e3d27dbdeb0f2883037168914a)"
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:218
build_context="(go=go1.12.8, user=root@587d0268f963,
date=20190903-15:01:40)"
level=debug ts=2020-03-31T14:22:58.506Z caller=cluster.go:149
component=cluster msg="resolved peers to following addresses" peers=<peers>
(...)
level=debug ts=2020-03-31T14:22:58.702Z caller=cluster.go:306
component=cluster memberlist="2020/03/31 14:22:58 [DEBUG] memberlist:
Initiating push/pull sync with: <peer IP>\n"
level=debug ts=2020-03-31T14:22:58.704Z caller=delegate.go:230
component=cluster received=NotifyJoin (...) addr=<peer IP>"
level=debug ts=2020-03-31T14:22:58.802Z caller=cluster.go:470
component=cluster msg="peer rejoined" (...)"
level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog
msg="gossiping new entry"
entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This
is the alert i want to inhibit", tag="target_tag" "}\"
receiver:<group_name:\"default-receiver\" (...)>
timestamp:<seconds:1585648804 nanos:750301 >
firing_alerts:3876410699172976497 > expires_at:<seconds:1586080804
nanos:750301 > "
level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog
msg="gossiping new entry"
entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This
is the alert that fires between 10 and 12AM", tag="source_tag" "}\"
receiver:<group_name:\"blackhole\" (...)> "
The alert that is used to inhibit the other is received from the peer
before the other fires and we get a notification for something that is
supposed to stay quiet.
Do you know if is there a way to priorize an alert or wait for all gossips
from the peers to end before sending notifications? We tried with the flag
--cluster.settle-timeout=2m
but it doesn't work.
Thanks a lot!
Regards,
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/29246f2d-1d3b-4643-9889-d0805151431d%40googlegroups.com.