In notify/notify.go I see:
for {
i++
// Always check the context first to not notify again.
select {
case <-ctx.Done():
if iErr == nil {
iErr = ctx.Err()
}
return ctx, nil, errors.Wrapf(iErr, "%s/%s: notify
retry canceled after %d attempts", r.groupName, r.integration.String(), i)
That is: it keeps retrying at exponential intervals until the overall
context expires - which according to your measurements is 1 minute.
I'm not entirely sure where this limit comes from, but it might be the
group_interval - see dispatch/dispatch.go:
// Give the notifications time until the next flush
to
// finish before terminating them.
ctx, cancel := context.WithTimeout(ag.ctx,
ag.timeout(ag.opts.GroupInterval))
I don't think it's designed to be a long-term queue. If you have a
situation where the webhook endpoint really could be down for hours on end,
and you don't want to lose alerts, then I think you should run a local
webhook on the same server, which queues the requests and then delivers
them to the *real* webhook when it becomes available.
Of course, you'd also have to be happy that you may get a splurge of
alerts, many of which may already have been resolved.
On Tuesday, 1 December 2020 at 09:47:17 UTC [email protected]
wrote:
> Does anyone know how alertmanager can be configured to allow permanent
> notify retries? If connection was lost to the webhook target for several
> hours, with my current setup none of the alerts that occurred during the
> outage would be sent, and no one would ever know something was amiss
>
> To add more context, the retries cease after 1 min, and it does 12 retries
> in total. I was looking through the alertmanager code and it seems that in
> v0.21 (which is the one we are running) the retries should be endless,
> capped at 1 min per retry (if I'm reading the backoff timer code correctly)
> so it seems odd that the retries end after one minute
>
> Here's a sample of the error I see in the Alertmanager logs:level=error
> ts=2020-11-27T13:03:54.660Z caller=dispatch.go:309 component=dispatcher
> msg="Notify for alerts failed" num_alerts=3 err="sd_webhook/webhook[0]:
> notify retry canceled after 12 attempts: Post \"http://192.168.1.10:4444\":
> dial tcp 192.168.1.10:4444: connect: connection refused"
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/d513a025-abfc-443c-94a5-05ad6bcb250cn%40googlegroups.com.