On 20/11/2021 23:42, Tony Di Nucci wrote:
Yes, the diagram is a bit of a simplification but not hugely.
There may be multiple instances of AlertRouter however they will share
a database. Most likely things will be kept simple (at least
initially) where each instance holds no state of its own. Each active
alert in the DB will be uniquely identified by the alert fingerprint
(which the AlertManager API provides, i.e. a hash of the alert groups
labels). Each non-active alert will have a composite key (where one
element is the alert group fingerprint).
In this architecture I see AlertManager having the responsibilities of
capturing, grouping, inhibiting and silencing alerts. The AlertRouter
will have the responsibilities of; enriching alerts, routing based on
business rules, monitoring/guaranteeing delivery and enabling analysis
of alert history.
Due to my requirements, I think I need something like the
AlertRouter. The question is really, am I better to push from
AlertManager to AlertRouter, or to have AlertRouter pull from
AlertManager. My current opinion is that pulling comes with more
benefits but since I've not seen anyone else doing this I'm concerned
there could be good reasons (I'm not aware of) for not doing this.
If you really must have another system connected to Alertmanager having
it respond to webhook notifications would be the much simpler option.
You'd still need to run multiple copies of you application behind a load
balancer (and have a clustered database) for HA, but at least you'd not
have the complexity of each instance having to discover all the
Alertmanager instances, query them and then deduplicate amongst the
different instances (again something that Alertmanager does itself already).
I'm still struggling to see why you need an extra system at all - it
feels very much like you'd be increasing complexity significantly which
naturally decreases reliability (more bits to break, have bugs or act in
unexpected ways) and slow things down (as there is another "hop" for an
alert to pass through). All of the things you mention can be done
already through Alertmanager, or could be done pretty simply with a
webhook receiver (without the need for any additional state storage, etc.)
* Adding data to an alert could be done with a simple webhook receiver,
that accepts an alert and then forwards it on to another API with extra
information added (no need for any state)
* Routing can be done within Alertmanager, or for more complex cases
could again be handled by a stateless webhook receiver
* With regards to "guaranteeing" delivery I don't see your suggestion in
allowing that (I believe it would actually make that less likely overall
due to the added complexity and likelihood of bugs/unhandled cases).
Alertmanager already does a good job of retrying on errors (and updating
metrics if that happens) but not much can be done if the final system is
totally down for long periods of time (and for many systems if that
happens old alerts aren't very useful once it is back, as they may have
already resolved).
* Alertmanager and Prometheus already expose a number of useful metrics
(make sure your Prometheus is scraping itself & all the connected
Alertmanagers) which should give you lots of useful information about
alert history (with the advantage of that data being with the monitoring
system you already know [with whatever you have connected like
dashboards, alerts, etc.])
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-developers/894f2f0c-2a8e-dc83-d4fa-cf4a1d605db9%40Jahingo.com.