[prometheus-users] Alertmanager: flapping on resolve when running multiple prometheus replicas

'T.G.' via Prometheus Users Wed, 01 Sep 2021 02:25:25 -0700

Hi

For redundancy, we are running two replicas of Prometheus which then alerts 
to a HA Alertmanager cluster.


However, we noticed that we experienced flapping when an alert would 
resolve. This is what it looks like from the point of vue of a receiver:

* Alert is resolved
* Alert is reopened immediatly
* Alert is resolved immediatly

Usually this will last < 30 seconds

We are pretty sure this is due to the alert being evaluated as resolved by 
one of the replicas but not by the other.

We tried to increase the group interval but it seems that a solved status 
is forwarded immediately to receivers, independently of that

Is there a setting in Alertmanager we can use to prevent that? Here are the 
settings we have I think might be relevant:

-  Prometheus scrape interval: 15s
-  Prometheus evaluation interval: 1m
- Alertmanager group_wait: 20s
- Alertmanager group_interval: 1m

Best regards,
T

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/496e55f5-6051-466d-aa3c-fb68410c9347n%40googlegroups.com.

[prometheus-users] Alertmanager: flapping on resolve when running multiple prometheus replicas

Reply via email to