On 16/02/2020 10:09, bryan wrote:
yes, I'm running an alertmanager cluste, and I have turn on prometheus
"debug" level logging, but nothing could be found, for details:
Have you set --log.level=debug on the alertmanager processes as well?
I see the following in my (non-clustered) test environment:
Feb 16 11:04:41 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:04:41.923Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]
Feb 16 11:05:56 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:05:56.922Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]
Feb 16 11:06:26 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:06:26.952Z caller=dispatch.go:465 component=dispatcher
aggrGroup="{}:{alertname=\"UpDown\"}" msg=flushing
alerts=[UpDown[0f48c03][active]]
Feb 16 11:07:11 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:07:11.924Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]
This shows the alerts being received from prometheus. However I don't
see any debug logs for the SMTP exchanges when it's sending out mail.
When I resolve the problem, alertmanager logs show:
Feb 16 11:22:11 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:22:11.922Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][resolved]
Feb 16 11:23:26 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:23:26.921Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][resolved]
So I was wrong: prometheus *does* actively notify resolved alerts.
If the SMTP server was down, I didn't get any error logged. But after
restarting the SMTP server, the message was delivered - so it appears
that alertmanager does its own queueing and retrying.
One thing that might be useful to you is the alertmanager metrics for
failed notifications:
$ curl -s localhost:9093/metrics | grep notifications_failed
# HELP alertmanager_notifications_failed_total The total number of
failed notifications.
# TYPE alertmanager_notifications_failed_total counter
alertmanager_notifications_failed_total{integration="email"} 0
alertmanager_notifications_failed_total{integration="hipchat"} 0
alertmanager_notifications_failed_total{integration="opsgenie"} 0
alertmanager_notifications_failed_total{integration="pagerduty"} 0
alertmanager_notifications_failed_total{integration="pushover"} 0
alertmanager_notifications_failed_total{integration="slack"} 0
alertmanager_notifications_failed_total{integration="victorops"} 0
alertmanager_notifications_failed_total{integration="webhook"} 0
alertmanager_notifications_failed_total{integration="wechat"} 0
You could try this on all your alertmanager nodes, and see if a
particular one has problems with E-mail.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6df6f20c-6439-f315-eb5e-812e0ff328cd%40pobox.com.