Thank you for the quick reply Brian, much appreciated!
If the region has completely failed, then presumably there's nothing within that region that is worth alerting on anyway. You can monitor the alertmanager cluster in one region from a prometheus in another region, to get an alert of that failure mode. That's assuming that everything within that region is actually down. In some cases, some companies may have private network links (usually redundant) they use for network communication between regions. In the situation both redundant links to and from a given region go down, all systems and services within the region itself may still be fully functional. They just won't be visible to users who are located in another region (think an office location in America and an office location in Europe). In this case, we still want to continue monitoring all services in each region. Considering receivers such as Victorops will then send their payloads to an internet facing address, in theory this should continue working even in the case where a company's internal private cross region network link goes down. However, the simplest solution would be to have a single alertmanager cluster, spread across AZs in a single region; all the other prometheuses send their alerts to this cluster. Alerting is low traffic and I don't see a particular reason to have a separate alertmanager cluster in every region. You can test that you can reach prometheus in every other region, and then you have high confidence that prometheus in those regions will be able to contact the central alertmanager. In this context, i'm mostly referring to a setup where a company may be managing all of it's infrastructures across multiple private datacenters. With this approach, multiple AZ which are typically each hosted within a single DC, still run the risk of being inaccessible should the link to the DC go down. So let's say you have datacenters in 3 regions (AMER, EMEA and APAC) and you've chosen to have a single AM cluster in EMEA, should the link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus instances located in AMER or APAC won't be able to send alert notifications. If you instead of 2 or 3 alertmanager instances in each of these regions, wouldn't that still allow alerts to be received and actioned within each of those regions? I realize the other option could be just to have 3 separate AM clusters in each region and have all Prometheus instances in each region send to all existing 6 to 9 Alertmanager servers (2 or 3 in of AMER, EMEA and APAC regions). I realize we could centralize silence management with something like Karma <https://github.com/prymitive/karma> or alerta.io and have a single-pane of glass view on the global list of alerts although I'm just trying to see if we can eliminate a 3rd party application to accomplish this and stick with a single globally-distributed alertmanager cluster? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/9b45c618-afa3-4d3c-a2aa-6fe5862471bbn%40googlegroups.com.