Re: [prometheus-users] Alertmanager keeps resolving and reopening alerts

Joe Devilla Fri, 14 Feb 2020 10:01:49 -0800

Mostafa

Have you had any luck resolving this issue?  I am running into the exact 
same problem.


Joe

On Monday, February 3, 2020 at 8:58:38 AM UTC-8, Mostafa Hajizadeh wrote:
>
> Hi, 
>
> Thanks for the tips. I will follow them and let you know. 
>
> BR, 
> Mostafa 
>
> > On 14 Bahman 1398, at 20:25, Simon Pasquier <[email protected] 
> <javascript:>> wrote: 
> > 
> > I would check the following query in the graph view to make sure that 
> > the alert is constantly firing. 
> > ALERTS{alertname="DiskSpaceLow"} 
> > You can remove resolve_timeout from your Alertmanager configuration 
> > (though it's unlikely to be the issue). It shouldn't be needed if you 
> > run a recent version of Prometheus. 
> > Other than that, try running Alertmanager with the "--log.level=debug" 
> flag. 
> > 
> > On Sat, Feb 1, 2020 at 9:07 AM Mostafa Hajizadeh <[email protected] 
> <javascript:>> wrote: 
> >> 
> >> Hi, 
> >> 
> >> So sorry for the late response. I did not see your email. 
> >> 
> >> Both are 30s. 
> >> 
> >> This is from Prometheus’s configuration page in its web panel: 
> >> 
> >> global: 
> >>  scrape_interval: 30s 
> >>  scrape_timeout: 10s 
> >>  evaluation_interval: 30s 
> >> 
> >> We have set scrape_interval through our ServiceMonitor for each 
> service, but that is 30s too. 
> >> 
> >> BR, 
> >> Mostafa 
> >> 
> >> On 3 Bahman 1398, at 20:19, Simon Pasquier <[email protected] 
> <javascript:>> wrote: 
> >> 
> >> What's your evaluation_interval and scrape_interval in Prometheus? 
> >> 
> >> On Sun, Jan 19, 2020 at 10:16 AM Mostafa Hajizadeh <[email protected] 
> <javascript:>> wrote: 
> >> 
> >> 
> >> Sorry for so many typos in the last paragraphs. :-) 
> >> 
> >> On Sunday, January 19, 2020 at 12:41:55 PM UTC+3:30, Mostafa Hajizadeh 
> wrote: 
> >> 
> >> 
> >> Hi, 
> >> 
> >> I’ve been struggling with this for days and still have not found the 
> root of the problem or a solution. 
> >> 
> >> We have configured Prometheus to send alerts to Alertmanager based on 
> Node Exporter data. Here are the rules defined in Prometheus: 
> >> 
> >>   - alert: DiskSpaceLow 
> >>     annotations: 
> >>       description: '{{ $labels.job }} reports remaining disk space on 
> mountpoint {{ $labels.mountpoint }} is {{ $value }}%' 
> >>       summary: Remaining disk space is low 
> >>     expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / 
> node_filesystem_size_bytes{fstype="ext4"} < 15 
> >>     for: 15m 
> >>     labels: 
> >>       severity: warning 
> >>   - alert: DiskSpaceLow 
> >>     annotations: 
> >>       description: '{{ $labels.job }} reports remaining disk space on 
> mountpoint {{ $labels.mountpoint }} is {{ $value }}%' 
> >>       summary: Remaining disk space is low 
> >>     expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / 
> node_filesystem_size_bytes{fstype="ext4"} < 2 
> >>     labels: 
> >>       severity: critical 
> >> 
> >> 
> >> 
> >> Here is a summarized version of our Alertmanager configuration: 
> >> 
> >> global: 
> >> 
> >> resolve_timeout: 1m 
> >> 
> >> 
> >> route: 
> >> 
> >> group_by: ['alertname', 'severity'] 
> >> 
> >> group_wait: 1m 
> >> 
> >> group_interval: 5m 
> >> 
> >> repeat_interval: 1d 
> >> 
> >> routes: 
> >> 
> >> - match: 
> >> 
> >>     endpoint: metrics 
> >> 
> >>   group_wait: 10s 
> >> 
> >>   group_interval: 1m 
> >> 
> >>   repeat_interval: 6h 
> >> 
> >>   continue: true 
> >> 
> >>   routes: 
> >> 
> >>   - match: 
> >> 
> >>       team: xxx 
> >> 
> >>     receiver: xxx-team-receiver 
> >> 
> >>   … 
> >> 
> >> 
> >> receivers: 
> >> 
> >> … 
> >> 
> >> 
> >> inhibit_rules: 
> >> 
> >> - target_match: 
> >> 
> >>   severity: warning 
> >> 
> >> source_match: 
> >> 
> >>   severity: critical 
> >> 
> >> equal: 
> >> 
> >> - alertname 
> >> 
> >> 
> >> 
> >> This is an example of what happens in our Slack channel for receiving 
> alerts: 
> >> 
> >> 
> >> 17:27:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:32:05 — 2 DiskSpaceLow alerts firing (server1 and server2) 
> >> 17:33:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:37:05 — DiskSpaceLow resolved 
> >> 17:37:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:42:35 — DiskSpaceLow resolved 
> >> 17:43:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:47:35 — DiskSpaceLow resolved 
> >> 17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:53:05 — 2 DiskSpaceLow alerts firing (server1 and server2) 
> >> 17:54:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 17:58:05 — DiskSpaceLow resolved 
> >> 17:59:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 18:04:05 — 2 DiskSpaceLow alerts firing (server1 and server2) 
> >> 18:05:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) 
> >> 
> >> 
> >> 
> >> It keeps flapping like this forever. Needless to say, the disk space on 
> these servers are not changing during this time, so the alerts should not 
> be resolved. 
> >> 
> >> During these “resolved” time ranges, I checked Prometheus’s web 
> interface and these alerts are still firing there. They never resolve 
> there. But they disappear from Alertmanager’s web interface during these 
> “resolved” times. 
> >> 
> >> I wrote a script to get the list of active alerts from Alertmanager’s 
> API every 30 seconds to see what appears here. Here’s an weird thins that I 
> saw there: at, say, 17:36:30, the alerts are there in the API and their 
> “endsAt” is set to 17:39:04 (three minutes after their updatedAt), but at 
> 17:37:00 there are not alerts at all. The API does not return any of the 
> previous alerts, even though their previous endsAt has not come yet. 
> >> 
> >> Why does Alertmanager suddenly resovles/removes these alerts before 
> their endsAt comes? 
> >> 
> >> Any help is appreciated because I have been struggling with this 
> problem for days. I even read the source code of Prometheus and 
> Alertmanager but could not find anything that could cause this problem 
> there. 
> >> 
> >> BR, 
> >> Mostafa 
> >> 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups "Prometheus Users" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> >> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/8bd1ba00-ea95-42ba-918c-c3bec66de1ae%40googlegroups.com.
>  
>
> >> 
> >> 
> >> 
> > 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bfaf6bbf-928d-4105-b38e-0a879b11b400%40googlegroups.com.

Re: [prometheus-users] Alertmanager keeps resolving and reopening alerts

Reply via email to