On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci <[email protected]> wrote:
> Thanks for the feedback Stuart, I really appreciate you taking the time > and you've given me reason to pause and reconsider my options. > > I fully understand your concerns over having a new data store. I'm not > sure that AlertManager and Prometheus contain the state I need though and > I'm not sure I should attempt to use Prometheus as the store for this state > (tracking per alert latencies would end up with a metric with unbounded > cardinality, each series would just contain a single data point and it > would be tricky to analyse this data). > > On the "guaranteeing" delivery front. You of course have a point that the > more moving parts there are the more that can go wrong. From the sounds of > things though I don't think we're debating the need for another system > (since this is what a webhook receiver would be?). > > Unless I'm mistaken, to hit the following requirements there'll need to be > a system external AlertManager and this will have to maintain some state: > * supporting complex alert enrichment (in ways that cannot be defined in > alerting rules) > We actually are interested in adding this to the alertmanager, there are a few open proposals for this. Basically the idea is that you can make an enrichment call at alert time to do things like grab metrics/dashboard snapshots, other system state, etc. > * support business specific alert routing rules (which are defined outside > of alerting rules) > The alertmanager routing rules are pretty powerful already. Depending on what you're interested in adding, this is something we could support directly. > * support detailed alert analysis (which includes per alert latencies) > This is, IMO, more of a logging problem. I think this is something we could add. You ship the alert notifications to any kind of BI system you like, ELK, etc. Maybe something to integrate into https://github.com/yakshaving-art/alertsnitch. > > I think this means that the question is limited to; is it better in my > case to push or pull from AlertManager. BTW, I'm sorry for the way I > worded my original post because I now realise how important it was to make > explicit the requirements that (I think) necessitate the majority of the > complexity. > Honestly, most of what you want is stuff we could support in Alertmanager without a lot of trouble. And are things that other users would want as well. Rather than build a whole new system, why not contribute improvements directly to the Alertmanager. > > As I still see it, the problems with the push approach (which are not > present with the pull approach are): > * It's only possible to know that an alert cannot be delivered after > waiting for *group_interval *(typically many minutes) > * At a given moment it's not possible to determine whether a specific > active alert has been delivered (at least I'm not aware of a way to > determine this) > * It is possible for alerts to be dropped (e.g. > https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277 > ) > > The tradeoffs for this are: > * I'd need to discover the AlertManager instances. This is pretty > straight forward in k8s. > * I may need to dedupe alert groups across AlertManager instances. I > think this would be pretty straight forward too, esp. since AlertManager > already populates fingerprints. > > > > > On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote: > >> On 20/11/2021 23:42, Tony Di Nucci wrote: >> > Yes, the diagram is a bit of a simplification but not hugely. >> > >> > There may be multiple instances of AlertRouter however they will share >> > a database. Most likely things will be kept simple (at least >> > initially) where each instance holds no state of its own. Each active >> > alert in the DB will be uniquely identified by the alert fingerprint >> > (which the AlertManager API provides, i.e. a hash of the alert groups >> > labels). Each non-active alert will have a composite key (where one >> > element is the alert group fingerprint). >> > >> > In this architecture I see AlertManager having the responsibilities of >> > capturing, grouping, inhibiting and silencing alerts. The AlertRouter >> > will have the responsibilities of; enriching alerts, routing based on >> > business rules, monitoring/guaranteeing delivery and enabling analysis >> > of alert history. >> > >> > Due to my requirements, I think I need something like the >> > AlertRouter. The question is really, am I better to push from >> > AlertManager to AlertRouter, or to have AlertRouter pull from >> > AlertManager. My current opinion is that pulling comes with more >> > benefits but since I've not seen anyone else doing this I'm concerned >> > there could be good reasons (I'm not aware of) for not doing this. >> >> If you really must have another system connected to Alertmanager having >> it respond to webhook notifications would be the much simpler option. >> You'd still need to run multiple copies of you application behind a load >> balancer (and have a clustered database) for HA, but at least you'd not >> have the complexity of each instance having to discover all the >> Alertmanager instances, query them and then deduplicate amongst the >> different instances (again something that Alertmanager does itself >> already). >> >> I'm still struggling to see why you need an extra system at all - it >> feels very much like you'd be increasing complexity significantly which >> naturally decreases reliability (more bits to break, have bugs or act in >> unexpected ways) and slow things down (as there is another "hop" for an >> alert to pass through). All of the things you mention can be done >> already through Alertmanager, or could be done pretty simply with a >> webhook receiver (without the need for any additional state storage, >> etc.) >> >> * Adding data to an alert could be done with a simple webhook receiver, >> that accepts an alert and then forwards it on to another API with extra >> information added (no need for any state) >> * Routing can be done within Alertmanager, or for more complex cases >> could again be handled by a stateless webhook receiver >> * With regards to "guaranteeing" delivery I don't see your suggestion in >> allowing that (I believe it would actually make that less likely overall >> due to the added complexity and likelihood of bugs/unhandled cases). >> Alertmanager already does a good job of retrying on errors (and updating >> metrics if that happens) but not much can be done if the final system is >> totally down for long periods of time (and for many systems if that >> happens old alerts aren't very useful once it is back, as they may have >> already resolved). >> * Alertmanager and Prometheus already expose a number of useful metrics >> (make sure your Prometheus is scraping itself & all the connected >> Alertmanagers) which should give you lots of useful information about >> alert history (with the advantage of that data being with the monitoring >> system you already know [with whatever you have connected like >> dashboards, alerts, etc.]) >> >> -- >> Stuart Clark >> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABbyFmrFBe4zD0d5mYQHsngWaTVktq2SogbtKgivnQFOR9bFEA%40mail.gmail.com.

