Hi For redundancy, we are running two replicas of Prometheus which then alerts to a HA Alertmanager cluster.
However, we noticed that we experienced flapping when an alert would resolve. This is what it looks like from the point of vue of a receiver: * Alert is resolved * Alert is reopened immediatly * Alert is resolved immediatly Usually this will last < 30 seconds We are pretty sure this is due to the alert being evaluated as resolved by one of the replicas but not by the other. We tried to increase the group interval but it seems that a solved status is forwarded immediately to receivers, independently of that Is there a setting in Alertmanager we can use to prevent that? Here are the settings we have I think might be relevant: - Prometheus scrape interval: 15s - Prometheus evaluation interval: 1m - Alertmanager group_wait: 20s - Alertmanager group_interval: 1m Best regards, T -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/496e55f5-6051-466d-aa3c-fb68410c9347n%40googlegroups.com.

