On 20/11/2021 23:42, Tony Di Nucci wrote:
Yes, the diagram is a bit of a simplification but not hugely.

There may be multiple instances of AlertRouter however they will share a database.  Most likely things will be kept simple (at least initially) where each instance holds no state of its own.  Each active alert in the DB will be uniquely identified by the alert fingerprint (which the AlertManager API provides, i.e. a hash of the alert groups labels).  Each non-active alert will have a composite key (where one element is the alert group fingerprint).

In this architecture I see AlertManager having the responsibilities of capturing, grouping, inhibiting and silencing alerts.  The AlertRouter will have the responsibilities of; enriching alerts, routing based on business rules, monitoring/guaranteeing delivery and enabling analysis of alert history.

Due to my requirements, I think I need something like the AlertRouter.  The question is really, am I better to push from AlertManager to AlertRouter, or to have AlertRouter pull from AlertManager.  My current opinion is that pulling comes with more benefits but since I've not seen anyone else doing this I'm concerned there could be good reasons (I'm not aware of) for not doing this.

If you really must have another system connected to Alertmanager having it respond to webhook notifications would be the much simpler option. You'd still need to run multiple copies of you application behind a load balancer (and have a clustered database) for HA, but at least you'd not have the complexity of each instance having to discover all the Alertmanager instances, query them and then deduplicate amongst the different instances (again something that Alertmanager does itself already).

I'm still struggling to see why you need an extra system at all - it feels very much like you'd be increasing complexity significantly which naturally decreases reliability (more bits to break, have bugs or act in unexpected ways) and slow things down (as there is another "hop" for an alert to pass through). All of the things you mention can be done already through Alertmanager, or could be done pretty simply with a webhook receiver (without the need for any additional state storage, etc.)

* Adding data to an alert could be done with a simple webhook receiver, that accepts an alert and then forwards it on to another API with extra information added (no need for any state) * Routing can be done within Alertmanager, or for more complex cases could again be handled by a stateless webhook receiver * With regards to "guaranteeing" delivery I don't see your suggestion in allowing that (I believe it would actually make that less likely overall due to the added complexity and likelihood of bugs/unhandled cases). Alertmanager already does a good job of retrying on errors (and updating metrics if that happens) but not much can be done if the final system is totally down for long periods of time (and for many systems if that happens old alerts aren't very useful once it is back, as they may have already resolved). * Alertmanager and Prometheus already expose a number of useful metrics (make sure your Prometheus is scraping itself & all the connected Alertmanagers) which should give you lots of useful information about alert history (with the advantage of that data being with the monitoring system you already know [with whatever you have connected like dashboards, alerts, etc.])

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/894f2f0c-2a8e-dc83-d4fa-cf4a1d605db9%40Jahingo.com.

Reply via email to