Hi! 

I think this is mostly a configuration issue so i'm posting this here 
before github to see if someone can help me. 

I have a prometheus server configured in kubernetes with 2 alertmanagers in 
HA. (1 prometheus server and 2 AlertManagers).

Alertmanager Configuration: 

================================================
# Deployment relevant bits
  prometheus-alertmanager:
    Image:         prom/alertmanager:v0.19.0
    Port:          9093/TCP
    Host Port:     0/TCP
    Args:
      --config.file=/etc/config/alertmanager.yml
      --storage.path=/data
      --log.level=debug
      --cluster.settle-timeout=2m
      --cluster.listen-address=0.0.0.0:19604



================================================
# Configmap relevant bits
receivers:
   (...)
route:
  group_wait: 120s
  group_interval: 5m
  receiver: default-receiver
  repeat_interval: 168h
  group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname', 
'objectid', 'alertid', 'resourceid']
  routes:
    - match:
        severity: blackhole
      receiver: blackhole
      continue: false
    - match:
        tag: "source_tag"
      receiver: blackhole
      repeat_interval: 1m
      group_interval: 1m
      continue: false
    (...)
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
- source_match:
    tag: "source_tag"
  target_match:
    tag: "target_tag"


Inhibition rules work like a charm, until one of the alertmanager dies. If 
a node on the cluster dies, one of the alertmanager pods needs to be 
realocated and restarts. When it restarts, we can see on the log file that  
the alert with the tag "tag: 'target_tag'", is received before the source 
tag one and the alert is fired. 


*Example: *

We have an alert in Prometheus that fires between 10 and 12 AM. While this 
alert is firing i want all the alerts that match some label (in this case 
tag: target_tag) to be inhibited. This approach works flawlessly unless the 
alertmanager is restarted and I can see on the logs:
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:217 msg="Starting 
Alertmanager" version="(version=0.19.0, branch=HEAD, 
revision=7aa5d19fea3f58e3d27dbdeb0f2883037168914a)"
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:218 
build_context="(go=go1.12.8, user=root@587d0268f963, 
date=20190903-15:01:40)"
level=debug ts=2020-03-31T14:22:58.506Z caller=cluster.go:149 
component=cluster msg="resolved peers to following addresses" peers=<peers>
(...)
level=debug ts=2020-03-31T14:22:58.702Z caller=cluster.go:306 
component=cluster memberlist="2020/03/31 14:22:58 [DEBUG] memberlist: 
Initiating push/pull sync with: <peer IP>\n"
level=debug ts=2020-03-31T14:22:58.704Z caller=delegate.go:230 
component=cluster received=NotifyJoin (...) addr=<peer IP>"
level=debug ts=2020-03-31T14:22:58.802Z caller=cluster.go:470 
component=cluster msg="peer rejoined" (...)"

level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This 
is the alert i want to inhibit", tag="target_tag" "}\" 
receiver:<group_name:\"default-receiver\" (...)> 
timestamp:<seconds:1585648804 nanos:750301 > 
firing_alerts:3876410699172976497 > expires_at:<seconds:1586080804 
nanos:750301 > "
level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This 
is the alert that fires between 10 and 12AM", tag="source_tag" "}\" 
receiver:<group_name:\"blackhole\" (...)> "

The alert that is used to inhibit the other is received from the peer 
before the other fires and we get a notification for something that is 
supposed to stay quiet.



Do you know if is there a way to priorize an alert or wait for all gossips 
from the peers to end before sending notifications? We tried with the flag 
--cluster.settle-timeout=2m 
but it doesn't work.


Thanks a lot!

Regards,


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/29246f2d-1d3b-4643-9889-d0805151431d%40googlegroups.com.

Reply via email to