I use Karma <https://github.com/prymitive/karma> as an alert dashboard.  It 
can combine multiple alertmanagers (or alertmanager clusters) into a single 
view, and push silences out to all of them.

alerta.io is another tool which I believe can do this, but I've not tried 
it.

I'd then suggest keeping it nice and simple:
- one alertmanager cluster in each region
- prometheus in each region talks to its own alertmanagers only
- don't attempt any gossip or any other interconnection between regions

On Monday, 29 November 2021 at 17:55:15 UTC Dj Fox wrote:

> Hello!
> I'm wondering how to correctly use Alertmanager at scale.
>
> I have 10 regions. In each region, a pair of Prometheuses scrap exactly 
> the same set of applications (which are also local, located in that region).
> Then, each region has a pair or HA Alertmanagers, which are gossiping 
> together.
> Each Prometheus is connected to the 2 Alertmanagers of its region.
>
> In order to benefit from a global metrics view + object storage, we are 
> using Thanos.
> It works great.
>
> *But with that kind of architecture, how I am supposed to silence an 
> alert?*
> I want silences to be propagated to all Alertmanagers of the whole world. 
> But if they are separated in 10 clusters of 2 members, this doesn't happen 
> automatically.
>
> How I am supposed to use the silencing system at scale? I can't afford 
> creating only one silence in the correct region where the alert was firing, 
> because it then means I have no global view of all of my silences, and I 
> can forget where they are. It becomes hard to manage, and sometimes I may 
> want to mute globally on several regions.
>
> The memberlist library used by Alertmanager seems to have been exactly 
> designed to exchange information between a lot of nodes of a big cluster, 
> and keeping at the same time a good performance.
>
> So, I then tried to connect all 20 Alertmanagers to the same Gossip 
> cluster. The goal is to make them automatically propagate their silences.
> By doing so, I made sure that one pair of Prometheus continues to ONLY be 
> connected to 2 Alertmanagers of the same region.
>
> => It works well and it does what I want: 
> - Silences are propagated everywhere
> - Alerts are gossiped to all nodes, but the other regions never do 
> anything with the alert that they receive only by Gossip and not by Prom.
> (If I understand correctly, an Alertmanager will never take responsibility 
> to notify for an alert if it has not received it by a Prom.)
>
> But then I noticed that in Alertmanager implementation, there is a timer 
> depending on the index position of each node in the memberlist cluster: an 
> Alertmanager receiving an alert from Prom will wait for 5s times its index 
> in the cluster.
> It means that if one Alertmanager region has index 19 and 20, I'll 
> introduce a delay of 19x5 = 95s before the notification can be sent.
>
> In official README in the Github project, it's cleary stated:
> *Important: Do not load balance traffic between Prometheus and its 
> Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. 
> The Alertmanager implementation expects all alerts to be sent to all 
> Alertmanagers to ensure high availability.*
>
> Do you have advise on how to handle "Silencing at scale" with Alertmanager?
>
> Usually, we say that Prometheus does not handle scale (beyond one node), 
> because it focuses on doing correctly its job, in a very efficient manner 
> (one Prometheus can ingest millions of samples and be very good at it).
> That's why tools have separated responsibilities, and Thanos/Cortex can 
> come to the rescue in that case.
> But in Thanos, I see no component designed to transform Alertmanager to be 
> scalable.
>
> Connecting ALL 20 Prometheus to ALL 20 Alertmanager seems a bit overkill 
> to me.
> I think it would make the cluster less robust, because I would expose 
> myself more and be more susceptible to network partitions, causing a higher 
> probability of failing alert deduplication (higher probability of being 
> notified twice for the same alert because of a higher probability that a 
> network partition will occur somewhere).
>
> Is it a good idea to connect all Alertmanager of different regions to the 
> same memberlist cluster, but at the same time, keeping only 2 Prom 
> connected to each Alertmanager?
>
> Thank you for your advice!
> Regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9ace501e-d72c-4aae-8d79-949a806c2bcdn%40googlegroups.com.

Reply via email to