Re: [prometheus-users] Alert stuck in Pending state, never fires

Julius Volz Tue, 27 Jun 2023 08:38:11 -0700

On Tue, Jun 20, 2023 at 1:15 PM Lena <[email protected]> wrote:

> Hello Julius,
>
> Thank you a lot for your reply.
> When I checked information about lookback delta previously I assumed that
> the graph should also show the missing results. So, if there is
> no datapoint it will be shown as a gap on graph. And graph showed
> non-interrupted line, so I did not consider checking it now. I see that I
> could be wrong.
>


This is possible, since the visibility of gaps would depend on the exact
alignment of the evaluation timestamp of the rule (or the evaluation step
in the graph) relative to the latest sample before that.


> I also did not consider using "min_over_time" expression previously, while
> it looks useful.
>
> I will definitely try suggested changes.
> Thank you again.
> Have a great time of the day.
> On Tuesday, June 20, 2023 at 1:28:54 PM UTC+3 Julius Volz wrote:
>
>> Hi Lena,
>>
>> One thing I see is that your scrape interval is very long: 300s, which is
>> exactly 5 minutes. The lookback delta of an instant vector selector is also
>> exactly 5 minutes (see https://www.youtube.com/watch?v=xIAEEQwUBXQ&t=272s),
>> which means that the selector will stop returning a result if there is ever
>> a case where there is no datapoint at least 5 minutes prior to the current
>> rule evaluation timestamp. That would reset the "for" duration again. With
>> a 5-minute scrape interval, that can indeed happen to you at times (either
>> just a bit of a delay in scraping or in ingesting scraped samples, or even
>> an occasional failed scrape). I'd recommend setting the interval short
>> enough that you can tolerate an occasional failed scrape (like 2m). Does
>> the problem go away with a shorter interval?
>>
>> By the way: 24h is quite a long "for" duration. If the series is ever
>> absent for an even longer period during those 24h (like if the exporter is
>> down for a couple of minutes), your alerts will always reset again. An
>> alternative could be to alert on an expression like
>> "min_over_time(database_disk_usage_bytes[24h]) > 15 * 1024 * 1024 * 1024"
>> with a much shorter "for" duration. But some "for" duration is still a good
>> idea, in the case of a fresh Prometheus server that doesn't have 24h of
>> data yet. That way, the alert would become less reliant on perfect scrape /
>> exporter behavior over a full 24h window.
>>
>> Regards,
>> Julius
>>
>> On Tue, Jun 20, 2023 at 10:24 AM Lena <[email protected]> wrote:
>>
>>> Hello,
>>> I hope you can help me with the issue I faced.
>>> I use disk_usage_exporter
>>> <https://github.com/dundee/disk_usage_exporter/> to get metrics about
>>> database sizes. The metrics are gathered by Prometheus each 5 min. The
>>> servicemonitor configuration is:
>>>   - interval: 300s
>>>    metricRelabelings:
>>>    - action: replace
>>>      regex: node_disk_usage_bytes
>>>      replacement: database_disk_usage_bytes
>>>      sourceLabels:
>>>      - __name__
>>>      targetLabel: __name__
>>>    path: /metrics
>>>    port: disk-exporter
>>>    relabelings:
>>>    - action: replace
>>>      regex: (.+)-mysql-slave
>>>      replacement: $1
>>>      sourceLabels:
>>>      - service
>>>      targetLabel: cluster
>>>    scrapeTimeout: 120s
>>> Then I have an alert to notify if some database has size of more than
>>> 15GB for 24hours:
>>>     - alert: MySQLDatabaseSize
>>>      expr: database_disk_usage_bytes > 15 * 1024 * 1024 * 1024
>>>      for: 24h
>>>      labels:
>>>        severity: warning
>>>      annotations:
>>>        dashboard: database-disk-usage?var-cluster={{ $labels.cluster }}
>>>        description: MySQL database `{{ $labels.path |reReplaceAll
>>> "/var/lib/mysql/" "" }}` takes `{{ $value | humanize }}` of disk space on
>>> pod `{{ $labels.pod }}`
>>>        summary: MySQL database has grown too big.
>>>
>>> On testing environment the alert fires properly. However on production
>>> environment it never fires, being stuck in Pending state, as `Active Since`
>>> time is being updated every ~5min.
>>> The only difference between environments is the number of databases in
>>> cluster.
>>> Below you can see screenshot of `Active Since` time, you see that time
>>> changes:
>>> [image: active_since1.png][image: active_since2.png]
>>> The metric labels are not changing. The graph is stable, so there are no
>>> missed metrics or gaps where database size is not defined.
>>> [image: graph.png]
>>>
>>> Scrape time takes ~20-40sec, however it's still within scrapeTimeout:
>>> 120sec
>>>
>>> The rule evaluation takes 1-2sec with evaluation_interval: 30sec
>>>
>>> Prometheus version is 2.22.1
>>>
>>> I see no related errors in Prometheus logs and have no clue what can be
>>> the reason of the issue.
>>>
>>> Thank you for any advise.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> Julius Volz
>> PromLabs - promlabs.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/4cdd6627-1559-4ff8-b9b1-9661adcaab58n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/4cdd6627-1559-4ff8-b9b1-9661adcaab58n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5yjFnePPjLJP74-Nbq27WWyAsTm%2BOGQHifHJ1y514hABQ%40mail.gmail.com.

Re: [prometheus-users] Alert stuck in Pending state, never fires

Reply via email to