Also, the alertmanager does have an "even store", it's a shared state between all instances.
If you're interested in changing some of the behavior of the retry mechanisms or how this works, feel free to open specific issues. You don't need to build an entirely new system, we can add new features to the existing Alertmanager clustering framework. On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie <[email protected]> wrote: > What gives you the impression that the Alertmanager is "best effort"? > > The alertmanager provides a reasonably robust HA solution (gossip > clustering). The only thing best-effort here is actually deduplication. The > Alertmanager design is "at least once" delivery, so it's robust against > network split-brain issues. So in the event of a failure, you may get > duplicate alerts, not none. > > When it comes to delivery, the Alertmanager does have retries. If a > connection to PagerDuty or other receivers has an issue, it will retry. > There are also metrics for this, so you can alert on failures to alternate > channels. > > What you likely need is a heartbeat setup. Because services like PagerDuty > and Slack do have outages, you can't guarantee delivery if they're down. > > The method here is to have an end-to-end "always firing heartbeat" alert, > which goes to a system/service like healthchecks.io or deadmanssnitch.com. > These will trigger an alert in the absence of your heartbeat. Letting you > know that some part of the pipeline has failed. > > On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci <[email protected]> > wrote: > >> Cross-posted from >> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610 >> >> In relation to alerting, I’m looking for a way to get strong alert >> delivery guarantees (and if delivery is not possible I want to know about >> it quickly). >> >> Unless I’m mistaken AlertManager only offers best-effort delivery. What’s >> puzzled me though is that I’ve not found anyone else speaking about this, >> so I worry I’m missing something obvious. Am I? >> >> Assuming I’m not mistaken I’ve been thinking of building a system with >> the architecture shown below. >> >> [image: alertmanager-alertrouting.png] >> >> Basically rather than having AlertManager try and push to destinations >> I’d have an AlertRouter which polls AlertManager. On each polling cycle the >> steps would be (neglecting any optimisations): >> >> - All active alerts are fetched from AlertManager. >> - The last known set of active alerts is read from the Alert Event >> Store. >> - The set of active alerts is compared with the last known state. >> - New alerts are added to an “active” partition in the Alert Event >> Store. >> - Resolved alerts are removed from the “active” partition and added >> to a “resolved” partition. >> >> A secondary process within AlertRouter would: >> >> - Check for alerts in the “active” partition which do not have a >> state of “delivered = true”. >> - Attempt to send each of these alerts and set the “delivered” flag. >> - Check for alerts in the “resolved” partition which do not have a >> state of “delivered = true”. >> - Attempt to send each of these resolved alerts and set the >> “delivered” flag. >> - Move all alerts in the “resolved” partition where “delivered=true” >> to a “completed” partition. >> >> Among other metrics, the AlertRouter would emit one called >> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to >> alert me to cases where any alert could not be delivered quickly enough. >> Since the alert is still held in the Alert Event Store it should be >> possible for me to resolve whatever issue is blocking and not lose the >> alert. >> >> I think there are other benefits to this architecture too, e.g. similar >> to the way Prometheus scrapes, natural back-pressure is a property of the >> system. >> >> Anyway, as mentioned I’ve not found anyone else doing something like this >> and this makes me wonder if there’s a very good reason not to. If anyone >> knows that this design is crazy I’d love to hear! >> >> Thanks >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Developers" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com >> <https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABbyFmrvgfcN0FnN9z29kNO3qMwBPOYUYTxwaLscH8zWmLJWOA%40mail.gmail.com.

