On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci <[email protected]> wrote:

> Thanks for the feedback Stuart, I really appreciate you taking the time
> and you've given me reason to pause and reconsider my options.
>
> I fully understand your concerns over having a new data store.  I'm not
> sure that AlertManager and Prometheus contain the state I need though and
> I'm not sure I should attempt to use Prometheus as the store for this state
> (tracking per alert latencies would end up with a metric with unbounded
> cardinality, each series would just contain a single data point and it
> would be tricky to analyse this data).
>
> On the "guaranteeing" delivery front.  You of course have a point that the
> more moving parts there are the more that can go wrong.  From the sounds of
> things though I don't think we're debating the need for another system
> (since this is what a webhook receiver would be?).
>
> Unless I'm mistaken, to hit the following requirements there'll need to be
> a system external AlertManager and this will have to maintain some state:
> * supporting complex alert enrichment (in ways that cannot be defined in
> alerting rules)
>

We actually are interested in adding this to the alertmanager, there are a
few open proposals for this. Basically the idea is that you can make an
enrichment call at alert time to do things like grab metrics/dashboard
snapshots, other system state, etc.


> * support business specific alert routing rules (which are defined outside
> of alerting rules)
>

The alertmanager routing rules are pretty powerful already. Depending on
what you're interested in adding, this is something we could support
directly.


> * support detailed alert analysis (which includes per alert latencies)
>

This is, IMO, more of a logging problem. I think this is something we could
add. You ship the alert notifications to any kind of BI system you like,
ELK, etc.

Maybe something to integrate into
https://github.com/yakshaving-art/alertsnitch.


>
> I think this means that the question is limited to; is it better in my
> case to push or pull from AlertManager.  BTW, I'm sorry for the way I
> worded my original post because I now realise how important it was to make
> explicit the requirements that (I think) necessitate the majority of the
> complexity.
>

Honestly, most of what you want is stuff we could support in Alertmanager
without a lot of trouble. And are things that other users would want as
well. Rather than build a whole new system, why not contribute improvements
directly to the Alertmanager.


>
> As I still see it, the problems with the push approach (which are not
> present with the pull approach are):
> * It's only possible to know that an alert cannot be delivered after
> waiting for *group_interval *(typically many minutes)
> * At a given moment it's not possible to determine whether a specific
> active alert has been delivered (at least I'm not aware of a way to
> determine this)
> * It is possible for alerts to be dropped (e.g.
> https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277
> )
>
> The tradeoffs for this are:
> * I'd need to discover the AlertManager instances.  This is pretty
> straight forward in k8s.
> * I may need to dedupe alert groups across AlertManager instances.  I
> think this would be pretty straight forward too, esp. since AlertManager
> already populates fingerprints.
>
>
>
>
> On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:
>
>> On 20/11/2021 23:42, Tony Di Nucci wrote:
>> > Yes, the diagram is a bit of a simplification but not hugely.
>> >
>> > There may be multiple instances of AlertRouter however they will share
>> > a database.  Most likely things will be kept simple (at least
>> > initially) where each instance holds no state of its own.  Each active
>> > alert in the DB will be uniquely identified by the alert fingerprint
>> > (which the AlertManager API provides, i.e. a hash of the alert groups
>> > labels).  Each non-active alert will have a composite key (where one
>> > element is the alert group fingerprint).
>> >
>> > In this architecture I see AlertManager having the responsibilities of
>> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter
>> > will have the responsibilities of; enriching alerts, routing based on
>> > business rules, monitoring/guaranteeing delivery and enabling analysis
>> > of alert history.
>> >
>> > Due to my requirements, I think I need something like the
>> > AlertRouter.  The question is really, am I better to push from
>> > AlertManager to AlertRouter, or to have AlertRouter pull from
>> > AlertManager.  My current opinion is that pulling comes with more
>> > benefits but since I've not seen anyone else doing this I'm concerned
>> > there could be good reasons (I'm not aware of) for not doing this.
>>
>> If you really must have another system connected to Alertmanager having
>> it respond to webhook notifications would be the much simpler option.
>> You'd still need to run multiple copies of you application behind a load
>> balancer (and have a clustered database) for HA, but at least you'd not
>> have the complexity of each instance having to discover all the
>> Alertmanager instances, query them and then deduplicate amongst the
>> different instances (again something that Alertmanager does itself
>> already).
>>
>> I'm still struggling to see why you need an extra system at all - it
>> feels very much like you'd be increasing complexity significantly which
>> naturally decreases reliability (more bits to break, have bugs or act in
>> unexpected ways) and slow things down (as there is another "hop" for an
>> alert to pass through). All of the things you mention can be done
>> already through Alertmanager, or could be done pretty simply with a
>> webhook receiver (without the need for any additional state storage,
>> etc.)
>>
>> * Adding data to an alert could be done with a simple webhook receiver,
>> that accepts an alert and then forwards it on to another API with extra
>> information added (no need for any state)
>> * Routing can be done within Alertmanager, or for more complex cases
>> could again be handled by a stateless webhook receiver
>> * With regards to "guaranteeing" delivery I don't see your suggestion in
>> allowing that (I believe it would actually make that less likely overall
>> due to the added complexity and likelihood of bugs/unhandled cases).
>> Alertmanager already does a good job of retrying on errors (and updating
>> metrics if that happens) but not much can be done if the final system is
>> totally down for long periods of time (and for many systems if that
>> happens old alerts aren't very useful once it is back, as they may have
>> already resolved).
>> * Alertmanager and Prometheus already expose a number of useful metrics
>> (make sure your Prometheus is scraping itself & all the connected
>> Alertmanagers) which should give you lots of useful information about
>> alert history (with the advantage of that data being with the monitoring
>> system you already know [with whatever you have connected like
>> dashboards, alerts, etc.])
>>
>> --
>> Stuart Clark
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABbyFmrFBe4zD0d5mYQHsngWaTVktq2SogbtKgivnQFOR9bFEA%40mail.gmail.com.

Reply via email to