On Thursday, 27 February 2025 at 15:37:54 UTC hartfordfive wrote: With this approach, multiple AZ which are typically each hosted within a single DC, still run the risk of being inaccessible should the link to the DC go down. So let's say you have datacenters in 3 regions (AMER, EMEA and APAC) and you've chosen to have a single AM cluster in EMEA, should the link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus instances located in AMER or APAC won't be able to send alert notifications. If you instead of 2 or 3 alertmanager instances in each of these regions, wouldn't that still allow alerts to be received and actioned within each of those regions?
Only you know what the meaningful failure modes are for your environment. It seems to me that you expect key DC-to-DC connectivity to go down, but you are still able to send alerts (presumably via Internet or some other out-of-band means). You could get Prometheus to talk to alertmanager over the Internet too, using https, if you felt that was more reliable. Also, if DC-to-DC communication is unreliable, then personally I would not want to run any sort of distributed application across it (alertmanager or otherwise), due to problems with partitioning / split brain. However, you need to make your own call as to what works best for you, and what is the optimum tradeoff between cost, complexity, and reliability. My gut feeling is towards simplicity and reliability, which for me means either a single global alertmanager cluster, or a separate AM cluster per region, but you can build whatever you're comfortable with. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com.