Thank you for the quick reply Brian, much appreciated!

 

If the region has completely failed, then presumably there's nothing within 
that region that is worth alerting on anyway. You can monitor the 
alertmanager cluster in one region from a prometheus in another region, to 
get an alert of that failure mode.


That's assuming that everything within that region is actually down.   In 
some cases, some companies may have private network links (usually 
redundant) they use for network communication between regions.  In the 
situation both redundant links to and from a given region go down,  all 
systems and services within the region itself may still be fully 
functional.   They just won't be visible to users who are located in 
another region (think an office location in America and an office location 
in Europe).  In this case, we still want to continue monitoring all 
services in each region.   Considering receivers such as Victorops will 
then send their payloads to an internet facing address, in theory this 
should continue working even in the case where a company's internal private 
cross region network link goes down.
 

However, the simplest solution would be to have a single alertmanager 
cluster, spread across AZs in a single region; all the other prometheuses 
send their alerts to this cluster. Alerting is low traffic and I don't see 
a particular reason to have a separate alertmanager cluster in every 
region.  You can test that you can reach prometheus in every other region, 
and then you have high confidence that prometheus in those regions will be 
able to contact the central alertmanager.


In this context, i'm mostly referring to  a setup where a company may be 
managing all of it's infrastructures across multiple private datacenters.  
 With this approach, multiple AZ which are typically each hosted within a 
single DC, still run the risk of being inaccessible should the link to the 
DC go down.   So let's say you have datacenters in 3 regions (AMER, EMEA 
and APAC) and you've chosen to have a single AM cluster in EMEA, should the 
link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus 
instances located in AMER or APAC won't be able to send alert 
notifications.   If you instead of 2 or 3 alertmanager instances in each of 
these regions, wouldn't that still allow alerts to be received and actioned 
within each of those regions?    

I realize the other option could be just to have 3 separate AM clusters in 
each region and have all Prometheus instances in each region send to all 
existing 6 to 9 Alertmanager servers (2 or 3 in of AMER, EMEA and APAC 
regions).   I realize we could centralize silence management with something 
like Karma <https://github.com/prymitive/karma> or alerta.io and have a 
single-pane of glass view on the global list of alerts although I'm just 
trying to see if we can eliminate a 3rd party application to accomplish 
this and stick with a single globally-distributed alertmanager cluster?



-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/9b45c618-afa3-4d3c-a2aa-6fe5862471bbn%40googlegroups.com.

Reply via email to