In notify/notify.go I see:

        for {
                i++
                // Always check the context first to not notify again.
                select {
                case <-ctx.Done():
                        if iErr == nil {
                                iErr = ctx.Err()
                        }

                        return ctx, nil, errors.Wrapf(iErr, "%s/%s: notify 
retry canceled after %d attempts", r.groupName, r.integration.String(), i)

That is: it keeps retrying at exponential intervals until the overall 
context expires - which according to your measurements is 1 minute.

I'm not entirely sure where this limit comes from, but it might be the 
group_interval - see dispatch/dispatch.go:

                        // Give the notifications time until the next flush 
to
                        // finish before terminating them.
                        ctx, cancel := context.WithTimeout(ag.ctx, 
ag.timeout(ag.opts.GroupInterval))

I don't think it's designed to be a long-term queue.  If you have a 
situation where the webhook endpoint really could be down for hours on end, 
and you don't want to lose alerts, then I think you should run a local 
webhook on the same server, which queues the requests and then delivers 
them to the *real* webhook when it becomes available.

Of course, you'd also have to be happy that you may get a splurge of 
alerts, many of which may already have been resolved.

On Tuesday, 1 December 2020 at 09:47:17 UTC [email protected] 
wrote:

> Does anyone know how alertmanager can be configured to allow permanent 
> notify retries? If connection was lost to the webhook target for several 
> hours, with my current setup none of the alerts that occurred during the 
> outage would be sent, and no one would ever know something was amiss 
>
> To add more context, the retries cease after 1 min, and it does 12 retries 
> in total. I was looking through the alertmanager code and it seems that in 
> v0.21 (which is the one we are running) the retries should be endless, 
> capped at 1 min per retry (if I'm reading the backoff timer code correctly) 
> so it seems odd that the retries end after one minute 
>
> Here's a sample of the error I see in the Alertmanager logs:level=error 
> ts=2020-11-27T13:03:54.660Z caller=dispatch.go:309 component=dispatcher 
> msg="Notify for alerts failed" num_alerts=3 err="sd_webhook/webhook[0]: 
> notify retry canceled after 12 attempts: Post \"http://192.168.1.10:4444\": 
> dial tcp 192.168.1.10:4444: connect: connection refused"  
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d513a025-abfc-443c-94a5-05ad6bcb250cn%40googlegroups.com.

Reply via email to