[prometheus-users] Re: Spreading single alertmanager cluster nodes over multiple geographical regions

hartfordfive Thu, 27 Feb 2025 07:38:03 -0800

Thank you for the quick reply Brian, much appreciated!

If the region has completely failed, then presumably there's nothing within
that region that is worth alerting on anyway. You can monitor the
alertmanager cluster in one region from a prometheus in another region, to
get an alert of that failure mode.

That's assuming that everything within that region is actually down. In
some cases, some companies may have private network links (usually
redundant) they use for network communication between regions. In the
situation both redundant links to and from a given region go down, all
systems and services within the region itself may still be fully
functional. They just won't be visible to users who are located in
another region (think an office location in America and an office location
in Europe). In this case, we still want to continue monitoring all
services in each region. Considering receivers such as Victorops will
then send their payloads to an internet facing address, in theory this
should continue working even in the case where a company's internal private
cross region network link goes down.

However, the simplest solution would be to have a single alertmanager
cluster, spread across AZs in a single region; all the other prometheuses
send their alerts to this cluster. Alerting is low traffic and I don't see
a particular reason to have a separate alertmanager cluster in every
region. You can test that you can reach prometheus in every other region,
and then you have high confidence that prometheus in those regions will be
able to contact the central alertmanager.

In this context, i'm mostly referring to a setup where a company may be
managing all of it's infrastructures across multiple private datacenters.
With this approach, multiple AZ which are typically each hosted within a
single DC, still run the risk of being inaccessible should the link to the
DC go down. So let's say you have datacenters in 3 regions (AMER, EMEA
and APAC) and you've chosen to have a single AM cluster in EMEA, should the
link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus
instances located in AMER or APAC won't be able to send alert
notifications. If you instead of 2 or 3 alertmanager instances in each of
these regions, wouldn't that still allow alerts to be received and actioned
within each of those regions?

I realize the other option could be just to have 3 separate AM clusters in
each region and have all Prometheus instances in each region send to all
existing 6 to 9 Alertmanager servers (2 or 3 in of AMER, EMEA and APAC
regions). I realize we could centralize silence management with something
like Karma <https://github.com/prymitive/karma> or alerta.io and have a
single-pane of glass view on the global list of alerts although I'm just
trying to see if we can eliminate a 3rd party application to accomplish
this and stick with a single globally-distributed alertmanager cluster?

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion visit
https://groups.google.com/d/msgid/prometheus-users/9b45c618-afa3-4d3c-a2aa-6fe5862471bbn%40googlegroups.com.

[prometheus-users] Re: Spreading single alertmanager cluster nodes over multiple geographical regions

Reply via email to