Re: [prometheus-developers] Is this alerting architecture crazy?

Tony Di Nucci Sat, 20 Nov 2021 15:42:47 -0800

Yes, the diagram is a bit of a simplification but not hugely.

There may be multiple instances of AlertRouter however they will share a 
database.  Most likely things will be kept simple (at least initially) 
where each instance holds no state of its own.  Each active alert in the DB 
will be uniquely identified by the alert fingerprint (which the 
AlertManager API provides, i.e. a hash of the alert groups labels).  Each 
non-active alert will have a composite key (where one element is the alert 
group fingerprint).


In this architecture I see AlertManager having the responsibilities of 
capturing, grouping, inhibiting and silencing alerts.  The AlertRouter will 
have the responsibilities of; enriching alerts, routing based on business 
rules, monitoring/guaranteeing delivery and enabling analysis of alert 
history.

Due to my requirements, I think I need something like the AlertRouter.  The 
question is really, am I better to push from AlertManager to AlertRouter, 
or to have AlertRouter pull from AlertManager.  My current opinion is that 
pulling comes with more benefits but since I've not seen anyone else doing 
this I'm concerned there could be good reasons (I'm not aware of) for not 
doing this.

On Saturday, November 20, 2021 at 5:38:06 PM UTC Stuart Clark wrote:

> It sounds like you are planning on creating a fairly complex system that 
> duplicates a reasonable amount of what Alertmanager already does. I'm 
> presuming your diagram is a simplification and that the application is 
> itself a cluster, so each instance would be querying each instance of 
> Alertmanager? Would your storage be part of the clustering system (similar 
> to Alertmanager) or another cluster of something like a relational 
> database? 
>
> On 20 November 2021 11:28:30 GMT, Tony Di Nucci <[email protected]> 
> wrote:
>>
>> There are other things I need to do as well, alert enrichment, complex 
>> routing, etc.  which means that I think some additional system is needed 
>> between AlertManager and the final destination in any case.
>>
>> The main question in my mind is really; are there reasons why I should 
>> prefer to have AlertManager push to this new system over having this new 
>> system pull?  
>>
>> My reasons for preferring a pull based architecture are:
>> * Just by looking at the AlertRouter we can get a reasonable 
>> understanding of overall health.  If alerts are pushed to the router then 
>> it alone can't tell the difference between no alerts firing and it not 
>> receiving alerts that have fired.
>> * Backpressure is a natural property of the system.
>>
>> With this extra context, what do you think?
>>
>> On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote:
>>
>>> Thanks for the feedback.
>>>
>>> > What gives you the impression that the Alertmanager is "best effort"?
>>> Sorry, best-effort probably wasn't the right term to use.  I am aware of 
>>> there being retries however these could still all fail and I'm thinking I 
>>> wouldn't be made aware of the issue for potentially quite a long time.
>>>
>>> My understanding is that an 
>>> *alertmanager_notification_requests_failed_total* counter will be 
>>> incremented each time there is a failed send attempt however from this 
>>> alone I can't tell the difference between a single alert that's 
>>> consistently failing and a small number of alerts which are all failing.  I 
>>> think this means that I've got to wait until 
>>> *alertmanager_notifications_failed_total 
>>> *is incremented before considering an alert to have failed (and this 
>>> can take many minutes) and then a bit of exploration is needed to figure 
>>> out which alert(s) failed.  Depending on the criticality of the alert it 
>>> may be fine for it to take some minutes before we're made aware of a 
>>> delivery problem, in other cases though it won't be.
>>>
>>> A couple of things I didn't really touch on originally which will also 
>>> help explain where my head is:
>>> * I have a requirement to be able to measure accurate latency per alert 
>>> through the alerting pipeline, i.e. for each alert I need to know the 
>>> amount of time it was known to AlertManager before it was successfully 
>>> written to the destination.
>>> * I have a requirement to be able to analyse historic alerts.
>>>
>>>
>>>
>>> On Saturday, November 20, 2021 at 10:33:12 AM UTC [email protected] 
>>> wrote:
>>>
>>>> Also, the alertmanager does have an "even store", it's a shared state 
>>>> between all instances.
>>>>
>>>> If you're interested in changing some of the behavior of the retry 
>>>> mechanisms or how this works, feel free to open specific issues. You don't 
>>>> need to build an entirely new system, we can add new features to the 
>>>> existing Alertmanager clustering framework.
>>>>
>>>> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie <[email protected]> wrote:
>>>>
>>>>> What gives you the impression that the Alertmanager is "best effort"?
>>>>>
>>>>> The alertmanager provides a reasonably robust HA solution (gossip 
>>>>> clustering). The only thing best-effort here is actually deduplication. 
>>>>> The 
>>>>> Alertmanager design is "at least once" delivery, so it's robust against 
>>>>> network split-brain issues. So in the event of a failure, you may get 
>>>>> duplicate alerts, not none.
>>>>>
>>>>> When it comes to delivery, the Alertmanager does have retries. If a 
>>>>> connection to PagerDuty or other receivers has an issue, it will retry. 
>>>>> There are also metrics for this, so you can alert on failures to 
>>>>> alternate 
>>>>> channels.
>>>>>
>>>>> What you likely need is a heartbeat setup. Because services like 
>>>>> PagerDuty and Slack do have outages, you can't guarantee delivery if 
>>>>> they're down.
>>>>>
>>>>> The method here is to have an end-to-end "always firing heartbeat" 
>>>>> alert, which goes to a system/service like healthchecks.io or 
>>>>> deadmanssnitch.com. These will trigger an alert in the absence of 
>>>>> your heartbeat. Letting you know that some part of the pipeline has 
>>>>> failed.
>>>>>
>>>>> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Cross-posted from 
>>>>>> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>>>>>>
>>>>>> In relation to alerting, I’m looking for a way to get strong alert 
>>>>>> delivery guarantees (and if delivery is not possible I want to know 
>>>>>> about 
>>>>>> it quickly).
>>>>>>
>>>>>> Unless I’m mistaken AlertManager only offers best-effort delivery. 
>>>>>> What’s puzzled me though is that I’ve not found anyone else speaking 
>>>>>> about 
>>>>>> this, so I worry I’m missing something obvious. Am I?
>>>>>>
>>>>>> Assuming I’m not mistaken I’ve been thinking of building a system 
>>>>>> with the architecture shown below.
>>>>>>
>>>>>> [image: alertmanager-alertrouting.png]
>>>>>>
>>>>>> Basically rather than having AlertManager try and push to 
>>>>>> destinations I’d have an AlertRouter which polls AlertManager. On each 
>>>>>> polling cycle the steps would be (neglecting any optimisations):
>>>>>>
>>>>>>    - All active alerts are fetched from AlertManager.
>>>>>>    - The last known set of active alerts is read from the Alert 
>>>>>>    Event Store.
>>>>>>    - The set of active alerts is compared with the last known state.
>>>>>>    - New alerts are added to an “active” partition in the Alert 
>>>>>>    Event Store.
>>>>>>    - Resolved alerts are removed from the “active” partition and 
>>>>>>    added to a “resolved” partition.
>>>>>>
>>>>>> A secondary process within AlertRouter would:
>>>>>>
>>>>>>    - Check for alerts in the “active” partition which do not have a 
>>>>>>    state of “delivered = true”.
>>>>>>    - Attempt to send each of these alerts and set the “delivered” 
>>>>>>    flag.
>>>>>>    - Check for alerts in the “resolved” partition which do not have 
>>>>>>    a state of “delivered = true”.
>>>>>>    - Attempt to send each of these resolved alerts and set the 
>>>>>>    “delivered” flag.
>>>>>>    - Move all alerts in the “resolved” partition where 
>>>>>>    “delivered=true” to a “completed” partition.
>>>>>>
>>>>>> Among other metrics, the AlertRouter would emit one called 
>>>>>> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used 
>>>>>> to 
>>>>>> alert me to cases where any alert could not be delivered quickly enough. 
>>>>>> Since the alert is still held in the Alert Event Store it should be 
>>>>>> possible for me to resolve whatever issue is blocking and not lose the 
>>>>>> alert.
>>>>>>
>>>>>> I think there are other benefits to this architecture too, e.g. 
>>>>>> similar to the way Prometheus scrapes, natural back-pressure is a 
>>>>>> property 
>>>>>> of the system.
>>>>>>
>>>>>> Anyway, as mentioned I’ve not found anyone else doing something like 
>>>>>> this and this makes me wonder if there’s a very good reason not to. If 
>>>>>> anyone knows that this design is crazy I’d love to hear!
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Prometheus Developers" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/dce70b1c-cb8d-44f9-9577-8f90e8f9510bn%40googlegroups.com.

Re: [prometheus-developers] Is this alerting architecture crazy?

Reply via email to