Re: [prometheus-users] Alert stuck in Pending state, never fires

Lena Tue, 20 Jun 2023 04:15:01 -0700

Hello Julius,

Thank you a lot for your reply.
When I checked information about lookback delta previously I assumed that 
the graph should also show the missing results. So, if there is 
no datapoint it will be shown as a gap on graph. And graph showed 
non-interrupted line, so I did not consider checking it now. I see that I 
could be wrong. 
I also did not consider using "min_over_time" expression previously, while 
it looks useful.


I will definitely try suggested changes.
Thank you again.
Have a great time of the day.
On Tuesday, June 20, 2023 at 1:28:54 PM UTC+3 Julius Volz wrote:

> Hi Lena,
>
> One thing I see is that your scrape interval is very long: 300s, which is 
> exactly 5 minutes. The lookback delta of an instant vector selector is also 
> exactly 5 minutes (see https://www.youtube.com/watch?v=xIAEEQwUBXQ&t=272s), 
> which means that the selector will stop returning a result if there is ever 
> a case where there is no datapoint at least 5 minutes prior to the current 
> rule evaluation timestamp. That would reset the "for" duration again. With 
> a 5-minute scrape interval, that can indeed happen to you at times (either 
> just a bit of a delay in scraping or in ingesting scraped samples, or even 
> an occasional failed scrape). I'd recommend setting the interval short 
> enough that you can tolerate an occasional failed scrape (like 2m). Does 
> the problem go away with a shorter interval?
>
> By the way: 24h is quite a long "for" duration. If the series is ever 
> absent for an even longer period during those 24h (like if the exporter is 
> down for a couple of minutes), your alerts will always reset again. An 
> alternative could be to alert on an expression like 
> "min_over_time(database_disk_usage_bytes[24h]) > 15 * 1024 * 1024 * 1024" 
> with a much shorter "for" duration. But some "for" duration is still a good 
> idea, in the case of a fresh Prometheus server that doesn't have 24h of 
> data yet. That way, the alert would become less reliant on perfect scrape / 
> exporter behavior over a full 24h window.
>
> Regards,
> Julius
>
> On Tue, Jun 20, 2023 at 10:24 AM Lena <[email protected]> wrote:
>
>> Hello,
>> I hope you can help me with the issue I faced.
>> I use disk_usage_exporter 
>> <https://github.com/dundee/disk_usage_exporter/> to get metrics about 
>> database sizes. The metrics are gathered by Prometheus each 5 min. The 
>> servicemonitor configuration is:
>>   - interval: 300s 
>>    metricRelabelings: 
>>    - action: replace 
>>      regex: node_disk_usage_bytes 
>>      replacement: database_disk_usage_bytes 
>>      sourceLabels: 
>>      - __name__ 
>>      targetLabel: __name__ 
>>    path: /metrics 
>>    port: disk-exporter 
>>    relabelings: 
>>    - action: replace 
>>      regex: (.+)-mysql-slave 
>>      replacement: $1 
>>      sourceLabels: 
>>      - service 
>>      targetLabel: cluster 
>>    scrapeTimeout: 120s
>> Then I have an alert to notify if some database has size of more than 
>> 15GB for 24hours:
>>     - alert: MySQLDatabaseSize 
>>      expr: database_disk_usage_bytes > 15 * 1024 * 1024 * 1024 
>>      for: 24h 
>>      labels: 
>>        severity: warning 
>>      annotations: 
>>        dashboard: database-disk-usage?var-cluster={{ $labels.cluster }} 
>>        description: MySQL database `{{ $labels.path |reReplaceAll 
>> "/var/lib/mysql/" "" }}` takes `{{ $value | humanize }}` of disk space on 
>> pod `{{ $labels.pod }}` 
>>        summary: MySQL database has grown too big.
>>
>> On testing environment the alert fires properly. However on production 
>> environment it never fires, being stuck in Pending state, as `Active Since` 
>> time is being updated every ~5min.
>> The only difference between environments is the number of databases in 
>> cluster. 
>> Below you can see screenshot of `Active Since` time, you see that time 
>> changes:
>> [image: active_since1.png][image: active_since2.png]
>> The metric labels are not changing. The graph is stable, so there are no 
>> missed metrics or gaps where database size is not defined.
>> [image: graph.png]
>>
>> Scrape time takes ~20-40sec, however it's still within scrapeTimeout: 
>> 120sec
>>
>> The rule evaluation takes 1-2sec with evaluation_interval: 30sec
>>
>> Prometheus version is 2.22.1
>>
>> I see no related errors in Prometheus logs and have no clue what can be 
>> the reason of the issue.
>>
>> Thank you for any advise.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
> Julius Volz
> PromLabs - promlabs.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4cdd6627-1559-4ff8-b9b1-9661adcaab58n%40googlegroups.com.

Re: [prometheus-users] Alert stuck in Pending state, never fires

Reply via email to