Thanks for the feedback Stuart, I really appreciate you taking the time and 
you've given me reason to pause and reconsider my options.

I fully understand your concerns over having a new data store.  I'm not 
sure that AlertManager and Prometheus contain the state I need though and 
I'm not sure I should attempt to use Prometheus as the store for this state 
(tracking per alert latencies would end up with a metric with unbounded 
cardinality, each series would just contain a single data point and it 
would be tricky to analyse this data).

On the "guaranteeing" delivery front.  You of course have a point that the 
more moving parts there are the more that can go wrong.  From the sounds of 
things though I don't think we're debating the need for another system 
(since this is what a webhook receiver would be?).  

Unless I'm mistaken, to hit the following requirements there'll need to be 
a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in 
alerting rules)
* support business specific alert routing rules (which are defined outside 
of alerting rules)
* support detailed alert analysis (which includes per alert latencies)

I think this means that the question is limited to; is it better in my case 
to push or pull from AlertManager.  BTW, I'm sorry for the way I worded my 
original post because I now realise how important it was to make explicit 
the requirements that (I think) necessitate the majority of the complexity.

As I still see it, the problems with the push approach (which are not 
present with the pull approach are):
* It's only possible to know that an alert cannot be delivered after 
waiting for *group_interval *(typically many minutes)
* At a given moment it's not possible to determine whether a specific 
active alert has been delivered (at least I'm not aware of a way to 
determine this)
* It is possible for alerts to be dropped 
(e.g. 
https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277)
 

The tradeoffs for this are:
* I'd need to discover the AlertManager instances.  This is pretty straight 
forward in k8s.
* I may need to dedupe alert groups across AlertManager instances.  I think 
this would be pretty straight forward too, esp. since AlertManager already 
populates fingerprints.


 

On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:

> On 20/11/2021 23:42, Tony Di Nucci wrote:
> > Yes, the diagram is a bit of a simplification but not hugely.
> >
> > There may be multiple instances of AlertRouter however they will share 
> > a database.  Most likely things will be kept simple (at least 
> > initially) where each instance holds no state of its own.  Each active 
> > alert in the DB will be uniquely identified by the alert fingerprint 
> > (which the AlertManager API provides, i.e. a hash of the alert groups 
> > labels).  Each non-active alert will have a composite key (where one 
> > element is the alert group fingerprint).
> >
> > In this architecture I see AlertManager having the responsibilities of 
> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter 
> > will have the responsibilities of; enriching alerts, routing based on 
> > business rules, monitoring/guaranteeing delivery and enabling analysis 
> > of alert history.
> >
> > Due to my requirements, I think I need something like the 
> > AlertRouter.  The question is really, am I better to push from 
> > AlertManager to AlertRouter, or to have AlertRouter pull from 
> > AlertManager.  My current opinion is that pulling comes with more 
> > benefits but since I've not seen anyone else doing this I'm concerned 
> > there could be good reasons (I'm not aware of) for not doing this.
>
> If you really must have another system connected to Alertmanager having 
> it respond to webhook notifications would be the much simpler option. 
> You'd still need to run multiple copies of you application behind a load 
> balancer (and have a clustered database) for HA, but at least you'd not 
> have the complexity of each instance having to discover all the 
> Alertmanager instances, query them and then deduplicate amongst the 
> different instances (again something that Alertmanager does itself 
> already).
>
> I'm still struggling to see why you need an extra system at all - it 
> feels very much like you'd be increasing complexity significantly which 
> naturally decreases reliability (more bits to break, have bugs or act in 
> unexpected ways) and slow things down (as there is another "hop" for an 
> alert to pass through). All of the things you mention can be done 
> already through Alertmanager, or could be done pretty simply with a 
> webhook receiver (without the need for any additional state storage, etc.)
>
> * Adding data to an alert could be done with a simple webhook receiver, 
> that accepts an alert and then forwards it on to another API with extra 
> information added (no need for any state)
> * Routing can be done within Alertmanager, or for more complex cases 
> could again be handled by a stateless webhook receiver
> * With regards to "guaranteeing" delivery I don't see your suggestion in 
> allowing that (I believe it would actually make that less likely overall 
> due to the added complexity and likelihood of bugs/unhandled cases). 
> Alertmanager already does a good job of retrying on errors (and updating 
> metrics if that happens) but not much can be done if the final system is 
> totally down for long periods of time (and for many systems if that 
> happens old alerts aren't very useful once it is back, as they may have 
> already resolved).
> * Alertmanager and Prometheus already expose a number of useful metrics 
> (make sure your Prometheus is scraping itself & all the connected 
> Alertmanagers) which should give you lots of useful information about 
> alert history (with the advantage of that data being with the monitoring 
> system you already know [with whatever you have connected like 
> dashboards, alerts, etc.])
>
> -- 
> Stuart Clark
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com.

Reply via email to