Part of the Prometheus/Alertmanager design is to better survive WAN split-brain.
IMO, running a wide Alertmanager cluster is a good idea when you have a wide network. The AM gossip protocol and deduplication is designed to fail open in the event of a split brain. The only thing you have to be aware of is that Prometheus-to-Alertmanager is an all-all communication. All Prometheus instances need to send to all Alertmanagers. On Thu, Feb 27, 2025 at 5:38 PM 'Brian Candler' via Prometheus Users < prometheus-users@googlegroups.com> wrote: > On Thursday, 27 February 2025 at 15:37:54 UTC hartfordfive wrote: > > With this approach, multiple AZ which are typically each hosted within a > single DC, still run the risk of being inaccessible should the link to the > DC go down. So let's say you have datacenters in 3 regions (AMER, EMEA > and APAC) and you've chosen to have a single AM cluster in EMEA, should the > link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus > instances located in AMER or APAC won't be able to send alert > notifications. If you instead of 2 or 3 alertmanager instances in each of > these regions, wouldn't that still allow alerts to be received and actioned > within each of those regions? > > > Only you know what the meaningful failure modes are for your environment. > It seems to me that you expect key DC-to-DC connectivity to go down, but > you are still able to send alerts (presumably via Internet or some other > out-of-band means). You could get Prometheus to talk to alertmanager over > the Internet too, using https, if you felt that was more reliable. > > Also, if DC-to-DC communication is unreliable, then personally I would not > want to run any sort of distributed application across it (alertmanager or > otherwise), due to problems with partitioning / split brain. > > However, you need to make your own call as to what works best for you, and > what is the optimum tradeoff between cost, complexity, and reliability. My > gut feeling is towards simplicity and reliability, which for me means > either a single global alertmanager cluster, or a separate AM cluster per > region, but you can build whatever you're comfortable with. > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to prometheus-users+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmq%3Dx%2Bwb%3DqKh0JN_K3hiTDn_MCe_7Me7ercgEK3jP7S8Pg%40mail.gmail.com.