Re: [prometheus-developers] Is this alerting architecture crazy?

Stuart Clark Sat, 20 Nov 2021 09:38:09 -0800

It sounds like you are planning on creating a fairly complex system that 
duplicates a reasonable amount of what Alertmanager already does. I'm presuming 
your diagram is a simplification and that the application is itself a cluster, 
so each instance would be querying each instance of Alertmanager? Would your 
storage be part of the clustering system (similar to Alertmanager) or another 
cluster of something like a relational database?


On 20 November 2021 11:28:30 GMT, Tony Di Nucci <[email protected]> wrote:
>There are other things I need to do as well, alert enrichment, complex 
>routing, etc.  which means that I think some additional system is needed 
>between AlertManager and the final destination in any case.
>
>The main question in my mind is really; are there reasons why I should 
>prefer to have AlertManager push to this new system over having this new 
>system pull?  
>
>My reasons for preferring a pull based architecture are:
>* Just by looking at the AlertRouter we can get a reasonable understanding 
>of overall health.  If alerts are pushed to the router then it alone can't 
>tell the difference between no alerts firing and it not receiving alerts 
>that have fired.
>* Backpressure is a natural property of the system.
>
>With this extra context, what do you think?
>
>On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote:
>
>> Thanks for the feedback.
>>
>> > What gives you the impression that the Alertmanager is "best effort"?
>> Sorry, best-effort probably wasn't the right term to use.  I am aware of 
>> there being retries however these could still all fail and I'm thinking I 
>> wouldn't be made aware of the issue for potentially quite a long time.
>>
>> My understanding is that an 
>> *alertmanager_notification_requests_failed_total* counter will be 
>> incremented each time there is a failed send attempt however from this 
>> alone I can't tell the difference between a single alert that's 
>> consistently failing and a small number of alerts which are all failing.  I 
>> think this means that I've got to wait until 
>> *alertmanager_notifications_failed_total 
>> *is incremented before considering an alert to have failed (and this can 
>> take many minutes) and then a bit of exploration is needed to figure out 
>> which alert(s) failed.  Depending on the criticality of the alert it may be 
>> fine for it to take some minutes before we're made aware of a delivery 
>> problem, in other cases though it won't be.
>>
>> A couple of things I didn't really touch on originally which will also 
>> help explain where my head is:
>> * I have a requirement to be able to measure accurate latency per alert 
>> through the alerting pipeline, i.e. for each alert I need to know the 
>> amount of time it was known to AlertManager before it was successfully 
>> written to the destination.
>> * I have a requirement to be able to analyse historic alerts.
>>
>>
>>
>> On Saturday, November 20, 2021 at 10:33:12 AM UTC [email protected] wrote:
>>
>>> Also, the alertmanager does have an "even store", it's a shared state 
>>> between all instances.
>>>
>>> If you're interested in changing some of the behavior of the retry 
>>> mechanisms or how this works, feel free to open specific issues. You don't 
>>> need to build an entirely new system, we can add new features to the 
>>> existing Alertmanager clustering framework.
>>>
>>> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie <[email protected]> wrote:
>>>
>>>> What gives you the impression that the Alertmanager is "best effort"?
>>>>
>>>> The alertmanager provides a reasonably robust HA solution (gossip 
>>>> clustering). The only thing best-effort here is actually deduplication. 
>>>> The 
>>>> Alertmanager design is "at least once" delivery, so it's robust against 
>>>> network split-brain issues. So in the event of a failure, you may get 
>>>> duplicate alerts, not none.
>>>>
>>>> When it comes to delivery, the Alertmanager does have retries. If a 
>>>> connection to PagerDuty or other receivers has an issue, it will retry. 
>>>> There are also metrics for this, so you can alert on failures to alternate 
>>>> channels.
>>>>
>>>> What you likely need is a heartbeat setup. Because services like 
>>>> PagerDuty and Slack do have outages, you can't guarantee delivery if 
>>>> they're down.
>>>>
>>>> The method here is to have an end-to-end "always firing heartbeat" 
>>>> alert, which goes to a system/service like healthchecks.io or 
>>>> deadmanssnitch.com. These will trigger an alert in the absence of your 
>>>> heartbeat. Letting you know that some part of the pipeline has failed.
>>>>
>>>> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci <[email protected]> 
>>>> wrote:
>>>>
>>>>> Cross-posted from 
>>>>> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>>>>>
>>>>> In relation to alerting, I’m looking for a way to get strong alert 
>>>>> delivery guarantees (and if delivery is not possible I want to know about 
>>>>> it quickly).
>>>>>
>>>>> Unless I’m mistaken AlertManager only offers best-effort delivery. 
>>>>> What’s puzzled me though is that I’ve not found anyone else speaking 
>>>>> about 
>>>>> this, so I worry I’m missing something obvious. Am I?
>>>>>
>>>>> Assuming I’m not mistaken I’ve been thinking of building a system with 
>>>>> the architecture shown below.
>>>>>
>>>>> [image: alertmanager-alertrouting.png]
>>>>>
>>>>> Basically rather than having AlertManager try and push to destinations 
>>>>> I’d have an AlertRouter which polls AlertManager. On each polling cycle 
>>>>> the 
>>>>> steps would be (neglecting any optimisations):
>>>>>
>>>>>    - All active alerts are fetched from AlertManager.
>>>>>    - The last known set of active alerts is read from the Alert Event 
>>>>>    Store.
>>>>>    - The set of active alerts is compared with the last known state.
>>>>>    - New alerts are added to an “active” partition in the Alert Event 
>>>>>    Store.
>>>>>    - Resolved alerts are removed from the “active” partition and added 
>>>>>    to a “resolved” partition.
>>>>>
>>>>> A secondary process within AlertRouter would:
>>>>>
>>>>>    - Check for alerts in the “active” partition which do not have a 
>>>>>    state of “delivered = true”.
>>>>>    - Attempt to send each of these alerts and set the “delivered” flag.
>>>>>    - Check for alerts in the “resolved” partition which do not have a 
>>>>>    state of “delivered = true”.
>>>>>    - Attempt to send each of these resolved alerts and set the 
>>>>>    “delivered” flag.
>>>>>    - Move all alerts in the “resolved” partition where 
>>>>>    “delivered=true” to a “completed” partition.
>>>>>
>>>>> Among other metrics, the AlertRouter would emit one called 
>>>>> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to 
>>>>> alert me to cases where any alert could not be delivered quickly enough. 
>>>>> Since the alert is still held in the Alert Event Store it should be 
>>>>> possible for me to resolve whatever issue is blocking and not lose the 
>>>>> alert.
>>>>>
>>>>> I think there are other benefits to this architecture too, e.g. similar 
>>>>> to the way Prometheus scrapes, natural back-pressure is a property of the 
>>>>> system.
>>>>>
>>>>> Anyway, as mentioned I’ve not found anyone else doing something like 
>>>>> this and this makes me wonder if there’s a very good reason not to. If 
>>>>> anyone knows that this design is crazy I’d love to hear!
>>>>>
>>>>> Thanks
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Developers" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Developers" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to [email protected].
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-developers/ded4a6d1-0218-403a-ba76-b982937053bbn%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/02045E2A-C563-4C7D-991F-3969A786C0D4%40Jahingo.com.

Re: [prometheus-developers] Is this alerting architecture crazy?

Reply via email to