Hi Brain,
Alerting rule: -
name: disk utilization increasing
rules:
- alert: disk_utilization
annotations:
summary: Disk utilization increased x%
expr: 100 -
(node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x}
* 100) > x%
for: 5m
labels:
severity: critical
Rule group:
receivers:
- name: x
route:
group_by:
- alertname
group_interval: 30s
group_wait: 30s
receiver: x
repeat_interval: 8h
Is there any way we can avoid getting alert resolve due to expr value
disappearing? (values are disappearing as node exporter running on host is
going down at some point).
Y*ou mean if we are fetching metrics from any exporter, that exporter
should always be UP and running when we are monitoring particular?.*
And I want to understand the points below.
*Questions as follows:*
1) How endsAt time is calculated? Is that calculated from *resend_delay*?
2) what is the default value of resend_delay or how to check this
configuration and in which file it is defined?
3)* msg:Received alerts *logs are received when prometheus sends alerts to
alertmanager? msg:flushing* logs get *logged when? (below)
4) evaluation_interval : 1m, and scrape_interval : 1m. then why did the
received alert at 12:34 and received alert at 12:36 have a time difference
of 2m?
When I do get a request for alerts from alertmanager, I could see
endsat time is +4 minutes from the last received alert, why is that so? *Is
my resend_delay 4m? Because, I didn't set the resend_delay value.*
On Mon, Aug 30, 2021 at 5:02 PM Brian Candler <[email protected]> wrote:
> Are you sure that's your problem? Can you show your complete alerting
> rule and its enclosing rule group?
>
> When starting an alert, the expression has to return a value for a certain
> amount of time ("for:") before the alert triggers. But the converse is not
> true: if the expr value disappears, even for a single evaluation cycle, the
> alert is immediately resolved.
>
> Therefore, try entering your alert expr in the PromQL browser, and look
> for any gaps in it. Any gap will resolve the alert.
>
> On Sunday, 29 August 2021 at 13:53:47 UTC+1 [email protected] wrote:
>
>> Hi,
>>
>> Recently, I've been debugging an issue where the alert is resolving even
>> though from prometheus it is in firing mode.
>> so, the cycle is firing->resolving->firing.
>>
>> After going through some documents and blogs, I found out that
>> alertmanager will resolve the alert if the prometheus doesn't send the
>> alert within the "*resolve_timeout*".
>> If, Prometheus now sends the *endsAt* field to the Alertmanager with a
>> very short timeout until AlertManager can mark the alert as resolved. This
>> overrides the *resolve_timeout* setting in AlertManager and creates the
>> firing->resolved->firing behavior if Prometheus does not resend the alert
>> before the short timeout.
>>
>> Is that understanding correct?
>>
>> *Questions as follows:*
>> 1) How endsAt time is calculated? Is that calculated from *resend_delay*?
>> 2) what is the default value of resend_delay or how to check this
>> configuration and in which file it is defined?
>>
>> 3)* msg:Received alerts *logs are received when prometheus sends alerts
>> to alertmanager? msg:flushing* logs get *logged when? (below)
>>
>> 4) evaluation_interval : 1m, and scrape_interval : 1m. then why did the
>> received alert at 12:34 and received alert at 12:36 have a time difference
>> of 2m?
>> When I do get a request for alerts from alertmanager, I could see
>> endsat time is +4 minutes from the last received alert, why is that so? *Is
>> my resend_delay 4m? Because, I didn't set the resend_delay value.*
>>
>> *Below are the logs from alertmanager :*
>>
>> level=debug ts=2021-08-29T12:34:40.342Z caller=dispatch.go:138
>> component=dispatcher msg="*Received alert*"
>> alert=disk_utilization[6356c43][active]
>> level=debug ts=2021-08-29T12:34:40.342Z caller=dispatch.go:138
>> component=dispatcher msg="*Received alert*"
>> alert=disk_utilization[1db5352][active]
>>
>> level=debug ts=2021-08-29T12:34:40.381Z caller=dispatch.go:473
>> component=dispatcher
>> aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}"
>> msg=flushing alerts="[disk_utilization[6356c43][active]
>> disk_utilization[1db5352][active]]"
>> level=debug ts=2021-08-29T12:35:10.381Z caller=dispatch.go:473
>> component=dispatcher
>> aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}"
>> msg=flushing alerts="[disk_utilization[6356c43][active]
>> disk_utilization[1db5352][active]]"
>> level=debug ts=2021-08-29T12:35:40.382Z caller=dispatch.go:473
>> component=dispatcher
>> aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}"
>> msg=flushing alerts="[disk_utilization[6356c43][active]
>> disk_utilization[1db5352][active]]"
>> level=debug ts=2021-08-29T12:36:10.382Z caller=dispatch.go:473
>> component=dispatcher
>> aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}"
>> msg=flushing alerts="[disk_utilization[6356c43][active]
>> disk_utilization[1db5352][active]]"
>>
>> level=debug ts=2021-08-29T12:36:40.345Z caller=dispatch.go:138
>> component=dispatcher msg="*Received alert*"
>> alert=disk_utilization[6356c43][active]
>> level=debug ts=2021-08-29T12:36:40.345Z caller=dispatch.go:138
>> component=dispatcher msg="*Received alert*"
>> alert=disk_utilization[1db5352][active]
>>
>> Get request from alertmanager:
>> curl http://10.233.49.116:9092/api/v1/alerts
>> {"status":"success","data":[{"labels":{"alertname":"disk_utilization","device":"xx.xx.xx.xx:/media/test","fstype":"nfs4","instance":"xx.xx.xx.xx","job":"test-1","mountpoint":"/media/test","node_name":"test-1","severity":"critical"},"annotations":{"summary":"Disk
>> utilization has crossed x%. Current Disk utilization =
>> 86.823044624783"},"startsAt":"2021-08-29T11:28:40.339802555Z",
>> *"endsAt":"2021-08-29T12:40:40.339802555Z",*"generatorURL":"x","status":{"state":"active","silencedBy":[],"inhibitedBy":[]},"receivers":["test-1"],"fingerprint":"1db535212ea6dcf6"},{"labels":{"alertname":"disk_utilization","device":"test","fstype":"ext4","instance":"xx.xx.xx.xx","job":"Node_test-1","mountpoint":"/","node_name":"test-1","severity":"critical"},"annotations":{"summary":"Disk
>> utilization has crossed x%. Current Disk utilization =
>> 94.59612027578963"},"startsAt":"2021-08-29T11:28:40.339802555Z","
>> *endsAt":"2021-08-29T12:40:40.339802555Z*
>> ","generatorURL":"x","status":{"state":"active","silencedBy":[],"inhibitedBy":[]},"receivers":["test-1"],"fingerprint":"6356c43dc3589622"}]}
>>
>>
>>
>> thanks,
>> Akshay
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/0af8717a-791a-49a1-9efc-f256273854b3n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/0af8717a-791a-49a1-9efc-f256273854b3n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/CAOrgXNJHMrEA4bHSkAMd3LH%2BsC4CLv9gAGNjyt4NPjJt9tfwZg%40mail.gmail.com.