Thanks for the feedback Stuart, I really appreciate you taking the time and you've given me reason to pause and reconsider my options.
I fully understand your concerns over having a new data store. I'm not sure that AlertManager and Prometheus contain the state I need though and I'm not sure I should attempt to use Prometheus as the store for this state (tracking per alert latencies would end up with a metric with unbounded cardinality, each series would just contain a single data point and it would be tricky to analyse this data). On the "guaranteeing" delivery front. You of course have a point that the more moving parts there are the more that can go wrong. From the sounds of things though I don't think we're debating the need for another system (since this is what a webhook receiver would be?). Unless I'm mistaken, to hit the following requirements there'll need to be a system external AlertManager and this will have to maintain some state: * supporting complex alert enrichment (in ways that cannot be defined in alerting rules) * support business specific alert routing rules (which are defined outside of alerting rules) * support detailed alert analysis (which includes per alert latencies) I think this means that the question is limited to; is it better in my case to push or pull from AlertManager. BTW, I'm sorry for the way I worded my original post because I now realise how important it was to make explicit the requirements that (I think) necessitate the majority of the complexity. As I still see it, the problems with the push approach (which are not present with the pull approach are): * It's only possible to know that an alert cannot be delivered after waiting for *group_interval *(typically many minutes) * At a given moment it's not possible to determine whether a specific active alert has been delivered (at least I'm not aware of a way to determine this) * It is possible for alerts to be dropped (e.g. https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277) The tradeoffs for this are: * I'd need to discover the AlertManager instances. This is pretty straight forward in k8s. * I may need to dedupe alert groups across AlertManager instances. I think this would be pretty straight forward too, esp. since AlertManager already populates fingerprints. On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote: > On 20/11/2021 23:42, Tony Di Nucci wrote: > > Yes, the diagram is a bit of a simplification but not hugely. > > > > There may be multiple instances of AlertRouter however they will share > > a database. Most likely things will be kept simple (at least > > initially) where each instance holds no state of its own. Each active > > alert in the DB will be uniquely identified by the alert fingerprint > > (which the AlertManager API provides, i.e. a hash of the alert groups > > labels). Each non-active alert will have a composite key (where one > > element is the alert group fingerprint). > > > > In this architecture I see AlertManager having the responsibilities of > > capturing, grouping, inhibiting and silencing alerts. The AlertRouter > > will have the responsibilities of; enriching alerts, routing based on > > business rules, monitoring/guaranteeing delivery and enabling analysis > > of alert history. > > > > Due to my requirements, I think I need something like the > > AlertRouter. The question is really, am I better to push from > > AlertManager to AlertRouter, or to have AlertRouter pull from > > AlertManager. My current opinion is that pulling comes with more > > benefits but since I've not seen anyone else doing this I'm concerned > > there could be good reasons (I'm not aware of) for not doing this. > > If you really must have another system connected to Alertmanager having > it respond to webhook notifications would be the much simpler option. > You'd still need to run multiple copies of you application behind a load > balancer (and have a clustered database) for HA, but at least you'd not > have the complexity of each instance having to discover all the > Alertmanager instances, query them and then deduplicate amongst the > different instances (again something that Alertmanager does itself > already). > > I'm still struggling to see why you need an extra system at all - it > feels very much like you'd be increasing complexity significantly which > naturally decreases reliability (more bits to break, have bugs or act in > unexpected ways) and slow things down (as there is another "hop" for an > alert to pass through). All of the things you mention can be done > already through Alertmanager, or could be done pretty simply with a > webhook receiver (without the need for any additional state storage, etc.) > > * Adding data to an alert could be done with a simple webhook receiver, > that accepts an alert and then forwards it on to another API with extra > information added (no need for any state) > * Routing can be done within Alertmanager, or for more complex cases > could again be handled by a stateless webhook receiver > * With regards to "guaranteeing" delivery I don't see your suggestion in > allowing that (I believe it would actually make that less likely overall > due to the added complexity and likelihood of bugs/unhandled cases). > Alertmanager already does a good job of retrying on errors (and updating > metrics if that happens) but not much can be done if the final system is > totally down for long periods of time (and for many systems if that > happens old alerts aren't very useful once it is back, as they may have > already resolved). > * Alertmanager and Prometheus already expose a number of useful metrics > (make sure your Prometheus is scraping itself & all the connected > Alertmanagers) which should give you lots of useful information about > alert history (with the advantage of that data being with the monitoring > system you already know [with whatever you have connected like > dashboards, alerts, etc.]) > > -- > Stuart Clark > > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com.

