On Thursday, 27 February 2025 at 15:37:54 UTC hartfordfive wrote:

With this approach, multiple AZ which are typically each hosted within a 
single DC, still run the risk of being inaccessible should the link to the 
DC go down.   So let's say you have datacenters in 3 regions (AMER, EMEA 
and APAC) and you've chosen to have a single AM cluster in EMEA, should the 
link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus 
instances located in AMER or APAC won't be able to send alert 
notifications.   If you instead of 2 or 3 alertmanager instances in each of 
these regions, wouldn't that still allow alerts to be received and actioned 
within each of those regions?    


Only you know what the meaningful failure modes are for your environment. 
It seems to me that you expect key DC-to-DC connectivity to go down, but 
you are still able to send alerts (presumably via Internet or some other 
out-of-band means).  You could get Prometheus to talk to alertmanager over 
the Internet too, using https, if you felt that was more reliable.

Also, if DC-to-DC communication is unreliable, then personally I would not 
want to run any sort of distributed application across it (alertmanager or 
otherwise), due to problems with partitioning / split brain.

However, you need to make your own call as to what works best for you, and 
what is the optimum tradeoff between cost, complexity, and reliability.  My 
gut feeling is towards simplicity and reliability, which for me means 
either a single global alertmanager cluster, or a separate AM cluster per 
region, but you can build whatever you're comfortable with.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com.

Reply via email to