I use Karma <https://github.com/prymitive/karma> as an alert dashboard. It can combine multiple alertmanagers (or alertmanager clusters) into a single view, and push silences out to all of them.
alerta.io is another tool which I believe can do this, but I've not tried it. I'd then suggest keeping it nice and simple: - one alertmanager cluster in each region - prometheus in each region talks to its own alertmanagers only - don't attempt any gossip or any other interconnection between regions On Monday, 29 November 2021 at 17:55:15 UTC Dj Fox wrote: > Hello! > I'm wondering how to correctly use Alertmanager at scale. > > I have 10 regions. In each region, a pair of Prometheuses scrap exactly > the same set of applications (which are also local, located in that region). > Then, each region has a pair or HA Alertmanagers, which are gossiping > together. > Each Prometheus is connected to the 2 Alertmanagers of its region. > > In order to benefit from a global metrics view + object storage, we are > using Thanos. > It works great. > > *But with that kind of architecture, how I am supposed to silence an > alert?* > I want silences to be propagated to all Alertmanagers of the whole world. > But if they are separated in 10 clusters of 2 members, this doesn't happen > automatically. > > How I am supposed to use the silencing system at scale? I can't afford > creating only one silence in the correct region where the alert was firing, > because it then means I have no global view of all of my silences, and I > can forget where they are. It becomes hard to manage, and sometimes I may > want to mute globally on several regions. > > The memberlist library used by Alertmanager seems to have been exactly > designed to exchange information between a lot of nodes of a big cluster, > and keeping at the same time a good performance. > > So, I then tried to connect all 20 Alertmanagers to the same Gossip > cluster. The goal is to make them automatically propagate their silences. > By doing so, I made sure that one pair of Prometheus continues to ONLY be > connected to 2 Alertmanagers of the same region. > > => It works well and it does what I want: > - Silences are propagated everywhere > - Alerts are gossiped to all nodes, but the other regions never do > anything with the alert that they receive only by Gossip and not by Prom. > (If I understand correctly, an Alertmanager will never take responsibility > to notify for an alert if it has not received it by a Prom.) > > But then I noticed that in Alertmanager implementation, there is a timer > depending on the index position of each node in the memberlist cluster: an > Alertmanager receiving an alert from Prom will wait for 5s times its index > in the cluster. > It means that if one Alertmanager region has index 19 and 20, I'll > introduce a delay of 19x5 = 95s before the notification can be sent. > > In official README in the Github project, it's cleary stated: > *Important: Do not load balance traffic between Prometheus and its > Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. > The Alertmanager implementation expects all alerts to be sent to all > Alertmanagers to ensure high availability.* > > Do you have advise on how to handle "Silencing at scale" with Alertmanager? > > Usually, we say that Prometheus does not handle scale (beyond one node), > because it focuses on doing correctly its job, in a very efficient manner > (one Prometheus can ingest millions of samples and be very good at it). > That's why tools have separated responsibilities, and Thanos/Cortex can > come to the rescue in that case. > But in Thanos, I see no component designed to transform Alertmanager to be > scalable. > > Connecting ALL 20 Prometheus to ALL 20 Alertmanager seems a bit overkill > to me. > I think it would make the cluster less robust, because I would expose > myself more and be more susceptible to network partitions, causing a higher > probability of failing alert deduplication (higher probability of being > notified twice for the same alert because of a higher probability that a > network partition will occur somewhere). > > Is it a good idea to connect all Alertmanager of different regions to the > same memberlist cluster, but at the same time, keeping only 2 Prom > connected to each Alertmanager? > > Thank you for your advice! > Regards > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9ace501e-d72c-4aae-8d79-949a806c2bcdn%40googlegroups.com.

