Mostafa Have you had any luck resolving this issue? I am running into the exact same problem.
Joe On Monday, February 3, 2020 at 8:58:38 AM UTC-8, Mostafa Hajizadeh wrote: > > Hi, > > Thanks for the tips. I will follow them and let you know. > > BR, > Mostafa > > > On 14 Bahman 1398, at 20:25, Simon Pasquier <[email protected] > <javascript:>> wrote: > > > > I would check the following query in the graph view to make sure that > > the alert is constantly firing. > > ALERTS{alertname="DiskSpaceLow"} > > You can remove resolve_timeout from your Alertmanager configuration > > (though it's unlikely to be the issue). It shouldn't be needed if you > > run a recent version of Prometheus. > > Other than that, try running Alertmanager with the "--log.level=debug" > flag. > > > > On Sat, Feb 1, 2020 at 9:07 AM Mostafa Hajizadeh <[email protected] > <javascript:>> wrote: > >> > >> Hi, > >> > >> So sorry for the late response. I did not see your email. > >> > >> Both are 30s. > >> > >> This is from Prometheus’s configuration page in its web panel: > >> > >> global: > >> scrape_interval: 30s > >> scrape_timeout: 10s > >> evaluation_interval: 30s > >> > >> We have set scrape_interval through our ServiceMonitor for each > service, but that is 30s too. > >> > >> BR, > >> Mostafa > >> > >> On 3 Bahman 1398, at 20:19, Simon Pasquier <[email protected] > <javascript:>> wrote: > >> > >> What's your evaluation_interval and scrape_interval in Prometheus? > >> > >> On Sun, Jan 19, 2020 at 10:16 AM Mostafa Hajizadeh <[email protected] > <javascript:>> wrote: > >> > >> > >> Sorry for so many typos in the last paragraphs. :-) > >> > >> On Sunday, January 19, 2020 at 12:41:55 PM UTC+3:30, Mostafa Hajizadeh > wrote: > >> > >> > >> Hi, > >> > >> I’ve been struggling with this for days and still have not found the > root of the problem or a solution. > >> > >> We have configured Prometheus to send alerts to Alertmanager based on > Node Exporter data. Here are the rules defined in Prometheus: > >> > >> - alert: DiskSpaceLow > >> annotations: > >> description: '{{ $labels.job }} reports remaining disk space on > mountpoint {{ $labels.mountpoint }} is {{ $value }}%' > >> summary: Remaining disk space is low > >> expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / > node_filesystem_size_bytes{fstype="ext4"} < 15 > >> for: 15m > >> labels: > >> severity: warning > >> - alert: DiskSpaceLow > >> annotations: > >> description: '{{ $labels.job }} reports remaining disk space on > mountpoint {{ $labels.mountpoint }} is {{ $value }}%' > >> summary: Remaining disk space is low > >> expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / > node_filesystem_size_bytes{fstype="ext4"} < 2 > >> labels: > >> severity: critical > >> > >> > >> > >> Here is a summarized version of our Alertmanager configuration: > >> > >> global: > >> > >> resolve_timeout: 1m > >> > >> > >> route: > >> > >> group_by: ['alertname', 'severity'] > >> > >> group_wait: 1m > >> > >> group_interval: 5m > >> > >> repeat_interval: 1d > >> > >> routes: > >> > >> - match: > >> > >> endpoint: metrics > >> > >> group_wait: 10s > >> > >> group_interval: 1m > >> > >> repeat_interval: 6h > >> > >> continue: true > >> > >> routes: > >> > >> - match: > >> > >> team: xxx > >> > >> receiver: xxx-team-receiver > >> > >> … > >> > >> > >> receivers: > >> > >> … > >> > >> > >> inhibit_rules: > >> > >> - target_match: > >> > >> severity: warning > >> > >> source_match: > >> > >> severity: critical > >> > >> equal: > >> > >> - alertname > >> > >> > >> > >> This is an example of what happens in our Slack channel for receiving > alerts: > >> > >> > >> 17:27:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:32:05 — 2 DiskSpaceLow alerts firing (server1 and server2) > >> 17:33:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:37:05 — DiskSpaceLow resolved > >> 17:37:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:42:35 — DiskSpaceLow resolved > >> 17:43:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:47:35 — DiskSpaceLow resolved > >> 17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:53:05 — 2 DiskSpaceLow alerts firing (server1 and server2) > >> 17:54:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 17:58:05 — DiskSpaceLow resolved > >> 17:59:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> 18:04:05 — 2 DiskSpaceLow alerts firing (server1 and server2) > >> 18:05:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3) > >> > >> > >> > >> It keeps flapping like this forever. Needless to say, the disk space on > these servers are not changing during this time, so the alerts should not > be resolved. > >> > >> During these “resolved” time ranges, I checked Prometheus’s web > interface and these alerts are still firing there. They never resolve > there. But they disappear from Alertmanager’s web interface during these > “resolved” times. > >> > >> I wrote a script to get the list of active alerts from Alertmanager’s > API every 30 seconds to see what appears here. Here’s an weird thins that I > saw there: at, say, 17:36:30, the alerts are there in the API and their > “endsAt” is set to 17:39:04 (three minutes after their updatedAt), but at > 17:37:00 there are not alerts at all. The API does not return any of the > previous alerts, even though their previous endsAt has not come yet. > >> > >> Why does Alertmanager suddenly resovles/removes these alerts before > their endsAt comes? > >> > >> Any help is appreciated because I have been struggling with this > problem for days. I even read the source code of Prometheus and > Alertmanager but could not find anything that could cause this problem > there. > >> > >> BR, > >> Mostafa > >> > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups "Prometheus Users" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected] <javascript:>. > >> To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/8bd1ba00-ea95-42ba-918c-c3bec66de1ae%40googlegroups.com. > > > >> > >> > >> > > > > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bfaf6bbf-928d-4105-b38e-0a879b11b400%40googlegroups.com.

