Does anyone know how alertmanager can be configured to allow permanent 
notify retries? If connection was lost to the webhook target for several 
hours, with my current setup none of the alerts that occurred during the 
outage would be sent, and no one would ever know something was amiss 

To add more context, the retries cease after 1 min, and it does 12 retries 
in total. I was looking through the alertmanager code and it seems that in 
v0.21 (which is the one we are running) the retries should be endless, 
capped at 1 min per retry (if I'm reading the backoff timer code correctly) 
so it seems odd that the retries end after one minute 

Here's a sample of the error I see in the Alertmanager logs:level=error 
ts=2020-11-27T13:03:54.660Z caller=dispatch.go:309 component=dispatcher 
msg="Notify for alerts failed" num_alerts=3 err="sd_webhook/webhook[0]: 
notify retry canceled after 12 attempts: Post \"http://192.168.1.10:4444\": 
dial tcp 192.168.1.10:4444: connect: connection refused"  

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/90aa1e69-eff8-4302-a081-22de12059d37n%40googlegroups.com.

Reply via email to