[prometheus-users] Re: Alerts are getting fire after every minute

Amol Nagotkar Wed, 05 Mar 2025 21:39:09 -0800

Thanks for the reply. 

1. when i keep evaluation_interval: 5m and for: 30s -> i get alerts every 5 
min. (those alerts gets store in prometheus and triggers every 5 min, i 
mean even if condition is not matching, i still used to get alerts every 
5min)



now i am changing config to below:-

evaluation_interval: 15s  *# on the rule group, or globally*

for: 5m   *# on the individual alerting rule(s)*

i will update you about this soon.


2. If you want a more readable string in the annotation, you can use {{ 
$value | humanize }}, but it will lose some precision.

This is serious concern for us. how to solve this?

On Wednesday, March 5, 2025 at 11:43:02 PM UTC+5:30 Brian Candler wrote:

> I notice that your "up == 0" graph shows lots of green which are values 
> where up == 0. These are legitimately generating alerts, in my opinion. If 
> you have set evaluation_interval to 5m, and "for:" to be less than 5m, then 
> a single instance of up == 0 will send an alert, because that's what you 
> asked for.
>
> *> I want alerts to be trigger after 5 min and only if condition is true.*
>
> Then you want:
>
> evaluation_interval: 15s  # on the rule group, or globally
> for: 5m   # on the individual alerting rule(s)
>
> Then an alert will only be sent if alert condition has been present 
> consecutively for the whole 5 minutes (i.e. 20 cycles).
>
> Finally: you may find it helpful to include {{ $value }} in an annotation 
> on each alerting rule, so you can tell the value which triggered the alert. 
> I can see you've done this already in one of your alerts:
>
>    - alert: "Total Messages > 10k in last 1 min"
>       expr: rabbitmq_queue_messages > 10000
> ...
>
>       annotations:
>         summary: "'{{ $labels.queue }}' has total '*{{ $value }}*' 
> messages for more than 1 min."
>
> And this is reflected in the alert:
>
>       description: 'Queue QUEUE_NAME in RabbitMQ has total *1.110738e+06* 
> messages\n' +
>
>         'for more than 1 minutes.\n',
>
>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>
> rabbitmq_queue_messages is a vector containing zero or more instances of 
> that metric.
>
> rabbitmq_queue_messages > 10000 is a reduced vector, containing only those 
> instance of the metric with a value greater than 10000.
>
> You can see that the $value at the time the alert was generated 
> was 1.110738e+06, which is 1,110,738, and that's clearly a lot more than 
> 10,000. Hence you get an alert. It's what you asked for.
>
> If you want a more readable string in the annotation, you can use {{ 
> $value | humanize }}, but it will lose some precision.
>
> On Wednesday, 5 March 2025 at 10:28:15 UTC Amol Nagotkar wrote:
>
>> As u can see in below images
>> Last trigger was at 15:31:29
>> And receive emails after that time also, which is for example 15:35, 
>> 15:37, etc. 
>> [image: IMG-20250305-WA0061.jpg]
>>
>> [image: IMG-20250305-WA0060.jpg]
>> On Wednesday, March 5, 2025 at 3:28:20 PM UTC+5:30 Amol Nagotkar wrote:
>>
>>>
>>> Thank you for the quick reply.
>>>
>>> So, as i told you i am not using alertmanager. i am getting alerts based 
>>> on config->
>>>
>>> alerting:
>>>
>>>   alertmanagers:
>>>
>>>     - static_configs:
>>>
>>>         - targets:
>>>
>>>           - IP_ADDRESS_OF_EMAIL_APPLICATION:PORT
>>>
>>>
>>> written in prometheus.yml file. below is the alert response (array of 
>>> object) i am receiving from prometheus.
>>>
>>>
>>> [
>>>
>>>   {
>>>
>>>     annotations: {
>>>
>>>       description: 'Queue QUEUE_NAME in RabbitMQ has total 1.110738e+06 
>>> messages\n' +
>>>
>>>         'for more than 1 minutes.\n',
>>>
>>>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>>>
>>>     },
>>>
>>>     endsAt: '2025-02-03T06:33:31.893Z',
>>>
>>>     startsAt: '2025-02-03T06:28:31.893Z',
>>>
>>>     generatorURL: '
>>> http://helo-container-pr:9091/graph?g0.expr=rabbitmq_queue_messages+%3E+1e%2B06&g0.tab=1
>>> ',
>>>
>>>     labels: {
>>>
>>>       alertname: 'Total Messages > 10L in last 1 min',
>>>
>>>       instance: 'IP_ADDRESS:15692',
>>>
>>>       job: 'rabbitmq-rcs',
>>>
>>>       queue: 'QUEUE_NAME',
>>>
>>>       severity: 'critical',
>>>
>>>       vhost: 'webhook'
>>>
>>>     }
>>>
>>>   }
>>>
>>> ]
>>>
>>>
>>>
>>> *If i keep evaluation_internal**: **15s, it started triggering every 
>>> minute.* 
>>>
>>> *I want alerts to be trigger after 5 min and only if condition is true.*
>>> On Wednesday, March 5, 2025 at 2:18:34 PM UTC+5:30 Brian Candler wrote:
>>>
>>>> You still haven't shown an example of the actual alert you're concerned 
>>>> about (for example, the E-mail containing all the labels and the 
>>>> annotations)
>>>>
>>>> alertmanager cannot generate any alert unless Prometheus triggers it. 
>>>> Please go into the PromQL web interface, switch to the "Graph" tab with 
>>>> the 
>>>> default 1 hour time window (or less), and enter the following queries:
>>>>
>>>> up == 0
>>>> rabbitmq_queue_consumers == 0
>>>> rabbitmq_queue_messages > 10000
>>>>
>>>> Show the graphs.  If they are not blank, then alerts will be generated. 
>>>>
>>>> "*for: 30s" *has no effect when you have "*evaluation_interval: 5m".* I 
>>>> suggest you use *evaluation_internal: 15s* (to match your scrape 
>>>> internal), and then "for: 30s" will have some benefit; it will only send 
>>>> an 
>>>> alert if the alerting condition has been true for two successive cycles.
>>>>
>>>> On Wednesday, 5 March 2025 at 07:50:23 UTC Amol Nagotkar wrote:
>>>>
>>>>> Thank you for the reply.
>>>>>
>>>>>
>>>>> answers for above points-
>>>>>
>>>>> 1. i checked expression "up == 0" is firing rarely and all my targets 
>>>>> are being scraped.
>>>>>
>>>>> 2. for not to get alerts every minutes, now i kept  *evaluation_interval 
>>>>> as 5m* 
>>>>>
>>>>> 3. i have removed keep_firing_for as it is not suitable for my use 
>>>>> case.
>>>>>
>>>>>
>>>>> Updated:
>>>>>
>>>>> I am using prometheus alerting for rabbitmq. Below is the 
>>>>> configuration I am using.
>>>>>
>>>>>
>>>>> *prometheus.yml file*
>>>>>
>>>>> global:
>>>>>
>>>>>   scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>> Default is every 1 minute.
>>>>>
>>>>>   evaluation_interval: 5m # Evaluate rules every 15 seconds. The 
>>>>> default is every 1 minute.
>>>>>
>>>>>   # scrape_timeout is set to the global default (10s).
>>>>>
>>>>>
>>>>> alerting:
>>>>>
>>>>>    alertmanagers:
>>>>>
>>>>>        - static_configs:
>>>>>
>>>>>            - targets:
>>>>>
>>>>>                - ip:port
>>>>>
>>>>> rule_files:
>>>>>
>>>>> - "alerts_rules.yml"
>>>>>
>>>>> scrape_configs:
>>>>>
>>>>> - job_name: "prometheus"
>>>>>
>>>>>   static_configs:
>>>>>
>>>>>   - targets: ["ip:port"]
>>>>>
>>>>>
>>>>> *alerts_rules.yml file*
>>>>>
>>>>> groups:
>>>>>
>>>>> - name: instance_alerts
>>>>>
>>>>>   rules:
>>>>>
>>>>>   - alert: "Instance Down"
>>>>>
>>>>>     expr: up == 0
>>>>>
>>>>>     for: 30s
>>>>>
>>>>>     # keep_firing_for: 30s
>>>>>
>>>>>     labels:
>>>>>
>>>>>       severity: "Critical"
>>>>>
>>>>>     annotations:
>>>>>
>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>
>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>> has been down for more than 30 sec."
>>>>>
>>>>>
>>>>> - name: rabbitmq_alerts
>>>>>
>>>>>   rules:
>>>>>
>>>>>     - alert: "Consumer down for last 1 min"
>>>>>
>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>
>>>>>       for: 30s
>>>>>
>>>>>       # keep_firing_for: 30s
>>>>>
>>>>>       labels:
>>>>>
>>>>>         severity: Critical
>>>>>
>>>>>       annotations:
>>>>>
>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>
>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>> attention is required."
>>>>>
>>>>>
>>>>>
>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>
>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>
>>>>>       for: 30s
>>>>>
>>>>>       # keep_firing_for: 30s
>>>>>
>>>>>       labels:
>>>>>
>>>>>         severity: Critical
>>>>>
>>>>>       annotations:
>>>>>
>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>> messages for more than 1 min."
>>>>>
>>>>>         description: |
>>>>>
>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value }} 
>>>>> messages for more than 1 min.
>>>>>
>>>>>
>>>>> Event if there is no data in queue, it sends me alerts. I have kept 
>>>>> *evaluation_interval: 
>>>>> 5m* ( Prometheus evaluates alert rules every 5 minutes) and *for: 30s* 
>>>>> (Ensures 
>>>>> the alert fires only if the condition persists for 30s).
>>>>>
>>>>> I guess *for* is not working for me.
>>>>>
>>>>> By the way* i am not using alertmanager*(
>>>>> https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.28.0.linux-amd64.tar.gz
>>>>> )
>>>>>
>>>>> i am just using *prometheus* (
>>>>> https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
>>>>> )
>>>>>
>>>>> https://prometheus.io/download/
>>>>>
>>>>> How can i solve this. Thank you in advance.
>>>>>
>>>>> On Saturday, February 15, 2025 at 12:13:01 AM UTC+5:30 Brian Candler 
>>>>> wrote:
>>>>>
>>>>>> > even if application is not down, it sends alerts every 1 min. how 
>>>>>> to debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>>
>>>>>> You need to show the actual alerts, from the Prometheus web interface 
>>>>>> and/or the notifications, and then describe how these are different from 
>>>>>> what you expect.
>>>>>>
>>>>>> I very much doubt that the expression "up == 0" is firing unless 
>>>>>> there is at least one target which is not being scraped, and therefore 
>>>>>> the 
>>>>>> "up" metric has a value of 0 for a particular timeseries (metric with a 
>>>>>> given set of labels).
>>>>>>
>>>>>> > if the threshold cross and value changes, it fires multiple alerts 
>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>> should 
>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>> fire for next 5 min. how to get this ??
>>>>>>
>>>>>> I cannot work out what problem you are trying to describe. As long as 
>>>>>> you only use '{{ $value }}' in annotations, not labels, then the same 
>>>>>> alert 
>>>>>> will just continue firing.
>>>>>>
>>>>>> Whether you get repeated *notifications* about that ongoing alert is 
>>>>>> a different matter. With "repeat_interval: 15m" you should get them 
>>>>>> every 
>>>>>> 15 minutes at least. You may get additional notifications if a new alert 
>>>>>> is 
>>>>>> added into the same alert group, or one is resolved from the alert group.
>>>>>>
>>>>>> > whats is for, keep_firing_for and evaluation_interval ?
>>>>>>
>>>>>> keep_firing_for is debouncing: once the alert condition has gone 
>>>>>> away, it will continue firing for this period of time. This is so that 
>>>>>> if 
>>>>>> the alert condition vanishes briefly but reappears, it doesn't cause the 
>>>>>> alert to be resolved and then retriggered.
>>>>>>
>>>>>> evaluation_interval is how often the alerting expression is evaluated.
>>>>>>
>>>>>>
>>>>>> On Friday, 14 February 2025 at 15:53:24 UTC Amol Nagotkar wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> i want same alert(alert rule) to be fire after 5 min, currently i am 
>>>>>>> getting same alert (alert rule) after every one minute for same '{{ 
>>>>>>> $value 
>>>>>>> }}'.
>>>>>>> if the threshold cross and value changes, it fires multiple alerts 
>>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>>> should 
>>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>>> fire for next 5 min. how to get this ??
>>>>>>> even if application is not down, it sends alerts every 1 min. how to 
>>>>>>> debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>>> whats is for, keep_firing_for and evaluation_interval ?
>>>>>>> prometheus.yml
>>>>>>>
>>>>>>> global:
>>>>>>> scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>>>> Default is every 1 minute.
>>>>>>> evaluation_interval: 15s # Evaluate rules every 15 seconds. The 
>>>>>>> default is every 1 minute.
>>>>>>>
>>>>>>> alerting:
>>>>>>> alertmanagers:
>>>>>>>
>>>>>>> - static_configs:
>>>>>>> - targets:
>>>>>>> - ip:port
>>>>>>>
>>>>>>> rule_files:
>>>>>>>
>>>>>>> - "alerts_rules.yml"
>>>>>>>
>>>>>>> scrape_configs:
>>>>>>>
>>>>>>> - job_name: "prometheus"
>>>>>>>   static_configs:
>>>>>>>   - targets: ["ip:port"]
>>>>>>>
>>>>>>> alertmanager.yml
>>>>>>> global:
>>>>>>> resolve_timeout: 5m
>>>>>>> route:
>>>>>>> group_wait: 5s
>>>>>>> group_interval: 5m
>>>>>>> repeat_interval: 15m
>>>>>>> receiver: webhook_receiver
>>>>>>> receivers:
>>>>>>>
>>>>>>> - name: webhook_receiver
>>>>>>>   webhook_configs:
>>>>>>>   - url: 'http://ip:port'
>>>>>>>     send_resolved: false
>>>>>>>
>>>>>>> alerts_rules.yml
>>>>>>>
>>>>>>>
>>>>>>> groups:
>>>>>>> - name: instance_alerts
>>>>>>>   rules:
>>>>>>>   - alert: "Instance Down"
>>>>>>>     expr: up == 0
>>>>>>>     # for: 30s
>>>>>>>     # keep_firing_for: 30s
>>>>>>>     labels:
>>>>>>>       severity: "Critical"
>>>>>>>     annotations:
>>>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>>>> has been down for more than 30 sec."
>>>>>>>
>>>>>>> - name: rabbitmq_alerts
>>>>>>>   rules:
>>>>>>>     - alert: "Consumer down for last 1 min"
>>>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>>>       # for: 1m
>>>>>>>       # keep_firing_for: 30s
>>>>>>>       labels:
>>>>>>>         severity: Critical
>>>>>>>       annotations:
>>>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>>>> attention is required."
>>>>>>>
>>>>>>>
>>>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>>>       # for: 1m
>>>>>>>       # keep_firing_for: 30s
>>>>>>>       labels:
>>>>>>>         severity: Critical
>>>>>>>       annotations:
>>>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>>>> messages for more than 1 min."
>>>>>>>         description: |
>>>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value 
>>>>>>> }} messages for more than 1 min.
>>>>>>>
>>>>>>>
>>>>>>> Thank you in advance.
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/dec1a3a8-2c00-4d20-9b92-511effb7b043n%40googlegroups.com.

[prometheus-users] Re: Alerts are getting fire after every minute

Reply via email to