It sounds like you are planning on creating a fairly complex system that duplicates a reasonable amount of what Alertmanager already does. I'm presuming your diagram is a simplification and that the application is itself a cluster, so each instance would be querying each instance of Alertmanager? Would your storage be part of the clustering system (similar to Alertmanager) or another cluster of something like a relational database?
On 20 November 2021 11:28:30 GMT, Tony Di Nucci <[email protected]> wrote: >There are other things I need to do as well, alert enrichment, complex >routing, etc. which means that I think some additional system is needed >between AlertManager and the final destination in any case. > >The main question in my mind is really; are there reasons why I should >prefer to have AlertManager push to this new system over having this new >system pull? > >My reasons for preferring a pull based architecture are: >* Just by looking at the AlertRouter we can get a reasonable understanding >of overall health. If alerts are pushed to the router then it alone can't >tell the difference between no alerts firing and it not receiving alerts >that have fired. >* Backpressure is a natural property of the system. > >With this extra context, what do you think? > >On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote: > >> Thanks for the feedback. >> >> > What gives you the impression that the Alertmanager is "best effort"? >> Sorry, best-effort probably wasn't the right term to use. I am aware of >> there being retries however these could still all fail and I'm thinking I >> wouldn't be made aware of the issue for potentially quite a long time. >> >> My understanding is that an >> *alertmanager_notification_requests_failed_total* counter will be >> incremented each time there is a failed send attempt however from this >> alone I can't tell the difference between a single alert that's >> consistently failing and a small number of alerts which are all failing. I >> think this means that I've got to wait until >> *alertmanager_notifications_failed_total >> *is incremented before considering an alert to have failed (and this can >> take many minutes) and then a bit of exploration is needed to figure out >> which alert(s) failed. Depending on the criticality of the alert it may be >> fine for it to take some minutes before we're made aware of a delivery >> problem, in other cases though it won't be. >> >> A couple of things I didn't really touch on originally which will also >> help explain where my head is: >> * I have a requirement to be able to measure accurate latency per alert >> through the alerting pipeline, i.e. for each alert I need to know the >> amount of time it was known to AlertManager before it was successfully >> written to the destination. >> * I have a requirement to be able to analyse historic alerts. >> >> >> >> On Saturday, November 20, 2021 at 10:33:12 AM UTC [email protected] wrote: >> >>> Also, the alertmanager does have an "even store", it's a shared state >>> between all instances. >>> >>> If you're interested in changing some of the behavior of the retry >>> mechanisms or how this works, feel free to open specific issues. You don't >>> need to build an entirely new system, we can add new features to the >>> existing Alertmanager clustering framework. >>> >>> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie <[email protected]> wrote: >>> >>>> What gives you the impression that the Alertmanager is "best effort"? >>>> >>>> The alertmanager provides a reasonably robust HA solution (gossip >>>> clustering). The only thing best-effort here is actually deduplication. >>>> The >>>> Alertmanager design is "at least once" delivery, so it's robust against >>>> network split-brain issues. So in the event of a failure, you may get >>>> duplicate alerts, not none. >>>> >>>> When it comes to delivery, the Alertmanager does have retries. If a >>>> connection to PagerDuty or other receivers has an issue, it will retry. >>>> There are also metrics for this, so you can alert on failures to alternate >>>> channels. >>>> >>>> What you likely need is a heartbeat setup. Because services like >>>> PagerDuty and Slack do have outages, you can't guarantee delivery if >>>> they're down. >>>> >>>> The method here is to have an end-to-end "always firing heartbeat" >>>> alert, which goes to a system/service like healthchecks.io or >>>> deadmanssnitch.com. These will trigger an alert in the absence of your >>>> heartbeat. Letting you know that some part of the pipeline has failed. >>>> >>>> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci <[email protected]> >>>> wrote: >>>> >>>>> Cross-posted from >>>>> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610 >>>>> >>>>> In relation to alerting, I’m looking for a way to get strong alert >>>>> delivery guarantees (and if delivery is not possible I want to know about >>>>> it quickly). >>>>> >>>>> Unless I’m mistaken AlertManager only offers best-effort delivery. >>>>> What’s puzzled me though is that I’ve not found anyone else speaking >>>>> about >>>>> this, so I worry I’m missing something obvious. Am I? >>>>> >>>>> Assuming I’m not mistaken I’ve been thinking of building a system with >>>>> the architecture shown below. >>>>> >>>>> [image: alertmanager-alertrouting.png] >>>>> >>>>> Basically rather than having AlertManager try and push to destinations >>>>> I’d have an AlertRouter which polls AlertManager. On each polling cycle >>>>> the >>>>> steps would be (neglecting any optimisations): >>>>> >>>>> - All active alerts are fetched from AlertManager. >>>>> - The last known set of active alerts is read from the Alert Event >>>>> Store. >>>>> - The set of active alerts is compared with the last known state. >>>>> - New alerts are added to an “active” partition in the Alert Event >>>>> Store. >>>>> - Resolved alerts are removed from the “active” partition and added >>>>> to a “resolved” partition. >>>>> >>>>> A secondary process within AlertRouter would: >>>>> >>>>> - Check for alerts in the “active” partition which do not have a >>>>> state of “delivered = true”. >>>>> - Attempt to send each of these alerts and set the “delivered” flag. >>>>> - Check for alerts in the “resolved” partition which do not have a >>>>> state of “delivered = true”. >>>>> - Attempt to send each of these resolved alerts and set the >>>>> “delivered” flag. >>>>> - Move all alerts in the “resolved” partition where >>>>> “delivered=true” to a “completed” partition. >>>>> >>>>> Among other metrics, the AlertRouter would emit one called >>>>> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to >>>>> alert me to cases where any alert could not be delivered quickly enough. >>>>> Since the alert is still held in the Alert Event Store it should be >>>>> possible for me to resolve whatever issue is blocking and not lose the >>>>> alert. >>>>> >>>>> I think there are other benefits to this architecture too, e.g. similar >>>>> to the way Prometheus scrapes, natural back-pressure is a property of the >>>>> system. >>>>> >>>>> Anyway, as mentioned I’ve not found anyone else doing something like >>>>> this and this makes me wonder if there’s a very good reason not to. If >>>>> anyone knows that this design is crazy I’d love to hear! >>>>> >>>>> Thanks >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Developers" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> > >-- >You received this message because you are subscribed to the Google Groups >"Prometheus Developers" group. >To unsubscribe from this group and stop receiving emails from it, send an >email to [email protected]. >To view this discussion on the web visit >https://groups.google.com/d/msgid/prometheus-developers/ded4a6d1-0218-403a-ba76-b982937053bbn%40googlegroups.com. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/02045E2A-C563-4C7D-991F-3969A786C0D4%40Jahingo.com.

