Re: [prometheus-developers] Is this alerting architecture crazy?

Stuart Clark Sun, 21 Nov 2021 14:28:54 -0800

On 20/11/2021 23:42, Tony Di Nucci wrote:

Yes, the diagram is a bit of a simplification but not hugely.
There may be multiple instances of AlertRouter however they will sharea database. Most likely things will be kept simple (at leastinitially) where each instance holds no state of its own. Each activealert in the DB will be uniquely identified by the alert fingerprint(which the AlertManager API provides, i.e. a hash of the alert groupslabels). Each non-active alert will have a composite key (where oneelement is the alert group fingerprint).
In this architecture I see AlertManager having the responsibilities ofcapturing, grouping, inhibiting and silencing alerts. The AlertRouterwill have the responsibilities of; enriching alerts, routing based onbusiness rules, monitoring/guaranteeing delivery and enabling analysisof alert history.
Due to my requirements, I think I need something like theAlertRouter. The question is really, am I better to push fromAlertManager to AlertRouter, or to have AlertRouter pull fromAlertManager. My current opinion is that pulling comes with morebenefits but since I've not seen anyone else doing this I'm concernedthere could be good reasons (I'm not aware of) for not doing this.

If you really must have another system connected to Alertmanager havingit respond to webhook notifications would be the much simpler option.You'd still need to run multiple copies of you application behind a loadbalancer (and have a clustered database) for HA, but at least you'd nothave the complexity of each instance having to discover all theAlertmanager instances, query them and then deduplicate amongst thedifferent instances (again something that Alertmanager does itself already).

I'm still struggling to see why you need an extra system at all - itfeels very much like you'd be increasing complexity significantly whichnaturally decreases reliability (more bits to break, have bugs or act inunexpected ways) and slow things down (as there is another "hop" for analert to pass through). All of the things you mention can be donealready through Alertmanager, or could be done pretty simply with awebhook receiver (without the need for any additional state storage, etc.)

* Adding data to an alert could be done with a simple webhook receiver,that accepts an alert and then forwards it on to another API with extrainformation added (no need for any state)* Routing can be done within Alertmanager, or for more complex casescould again be handled by a stateless webhook receiver* With regards to "guaranteeing" delivery I don't see your suggestion inallowing that (I believe it would actually make that less likely overalldue to the added complexity and likelihood of bugs/unhandled cases).Alertmanager already does a good job of retrying on errors (and updatingmetrics if that happens) but not much can be done if the final system istotally down for long periods of time (and for many systems if thathappens old alerts aren't very useful once it is back, as they may havealready resolved).* Alertmanager and Prometheus already expose a number of useful metrics(make sure your Prometheus is scraping itself & all the connectedAlertmanagers) which should give you lots of useful information aboutalert history (with the advantage of that data being with the monitoringsystem you already know [with whatever you have connected likedashboards, alerts, etc.])


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/894f2f0c-2a8e-dc83-d4fa-cf4a1d605db9%40Jahingo.com.

Re: [prometheus-developers] Is this alerting architecture crazy?

Reply via email to