[prometheus-users] Re: Alert notify retries cease after 1 min

Andres Sanchez Smith Tue, 01 Dec 2020 03:12:53 -0800

Thanks for the quick response! That would make sense, my group_interval is 
also 1m, I'll be sure to try that out to see if that's what is limiting it, 
although as you say, if that's the case, we'll probably have to implement 
some local webhook and alert storage solution. We would be delighted to get 
all the alerts, resolved or not :) we need them to keep track of what has 
happened in the system at different points in time.


Thank you for your help!

On Tuesday, December 1, 2020 at 11:48:44 AM UTC+1 [email protected] wrote:

> In notify/notify.go I see:
>
>         for {
>                 i++
>                 // Always check the context first to not notify again.
>                 select {
>                 case <-ctx.Done():
>                         if iErr == nil {
>                                 iErr = ctx.Err()
>                         }
>
>                         return ctx, nil, errors.Wrapf(iErr, "%s/%s: notify 
> retry canceled after %d attempts", r.groupName, r.integration.String(), i)
>
> That is: it keeps retrying at exponential intervals until the overall 
> context expires - which according to your measurements is 1 minute.
>
> I'm not entirely sure where this limit comes from, but it might be the 
> group_interval - see dispatch/dispatch.go:
>
>                         // Give the notifications time until the next 
> flush to
>                         // finish before terminating them.
>                         ctx, cancel := context.WithTimeout(ag.ctx, 
> ag.timeout(ag.opts.GroupInterval))
>
> I don't think it's designed to be a long-term queue.  If you have a 
> situation where the webhook endpoint really could be down for hours on end, 
> and you don't want to lose alerts, then I think you should run a local 
> webhook on the same server, which queues the requests and then delivers 
> them to the *real* webhook when it becomes available.
>
> Of course, you'd also have to be happy that you may get a splurge of 
> alerts, many of which may already have been resolved.
>
> On Tuesday, 1 December 2020 at 09:47:17 UTC [email protected] 
> wrote:
>
>> Does anyone know how alertmanager can be configured to allow permanent 
>> notify retries? If connection was lost to the webhook target for several 
>> hours, with my current setup none of the alerts that occurred during the 
>> outage would be sent, and no one would ever know something was amiss 
>>
>> To add more context, the retries cease after 1 min, and it does 12 
>> retries in total. I was looking through the alertmanager code and it seems 
>> that in v0.21 (which is the one we are running) the retries should be 
>> endless, capped at 1 min per retry (if I'm reading the backoff timer code 
>> correctly) so it seems odd that the retries end after one minute 
>>
>> Here's a sample of the error I see in the Alertmanager logs:level=error 
>> ts=2020-11-27T13:03:54.660Z caller=dispatch.go:309 component=dispatcher 
>> msg="Notify for alerts failed" num_alerts=3 err="sd_webhook/webhook[0]: 
>> notify retry canceled after 12 attempts: Post \"http://192.168.1.10:4444\": 
>> dial tcp 192.168.1.10:4444: connect: connection refused"  
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/07f2894e-6873-413b-9a3f-dc411f630421n%40googlegroups.com.

[prometheus-users] Re: Alert notify retries cease after 1 min

Reply via email to