Re: [prometheus-users] Re: How to debug possible false positive alarm?

Julius Volz Sat, 10 Jul 2021 14:41:08 -0700

Hi,

Could it be that when graphing the CPU usage, the graph resolution was just
low and thus it might have skipped over a short spike in the rate?


Try:

   max_over_time(cpu:usage[3d])

...or something like that to make sure that you are really looking at all
samples within a given time range, not just a subset depending on the graph
resolution.

Not sure though why the 70% one wouldn't have fired if the 90% did,
assuming the alerts were in the same rule group with same intervals (and
thus evaluation timestamps).

Btw., you most likely want to have some "for" duration on that alert to
make it less sensitive and/or also use rate() vs. irate() in the underlying
recording rule to actually look at 5m worth of CPU usage vs. just at the
last two samples of the 5m window.

Regards,
Julius

On Thu, Jul 8, 2021 at 7:48 PM James S <[email protected]> wrote:

> I have the same issue for one node the CPU usage, Prometheus firing false
> possitive.
> I checked in the GCP monitoring CPU Usage is not over 80%
>
> On Tuesday, October 2, 2018 at 10:14:38 AM UTC-4 [email protected]
> wrote:
>
>> Hi,
>>
>> I have 2 cpu usage alerts set up in prometheus:
>>
>> alert: Cpu_Usage_Greater_Than_70_Pct 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=ALERTS%7Balertname%3D%22Cpu_Usage_Greater_Than_70_Pct%22%7D&g0.tab=1>
>> expr: cpu:usage >
>>   70 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage+%3E+70&g0.tab=1>
>> labels:
>>   severity: warning
>> annotations:
>>   description: CPU Usage on these nodes is greater than 70 pct (over 5m)
>>   severity: warning
>>   summary: 'WARNING: CPU Usage is greater than 70 pct'
>>
>>
>> alert: Cpu_Usage_Greater_Than_90_Pct 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=ALERTS%7Balertname%3D%22Cpu_Usage_Greater_Than_90_Pct%22%7D&g0.tab=1>
>> expr: cpu:usage >
>>   90 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage+%3E+90&g0.tab=1>
>> labels:
>>   severity: danger
>> annotations:
>>   description: CPU Usage on these nodes is greater than 90 pct (over 5m)
>>   severity: danger
>>   summary: 'DANGER: CPU Usage is greater than 90 pct'
>>
>>
>>
>> Where cpu:usage is defind as:
>>
>> File: recording_rules.yml; Group name: Cpu Usage Percentage (over 5m)
>> -------
>> record: cpu:usage 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage&g0.tab=1>
>> expr: 100
>>   * (1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) BY (instance,
>>   job))) 
>> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=100+%2A+%281+-+%28avg%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29%29+BY+%28instance%2C+job%29%29%29&g0.tab=1>
>>
>>
>>
>>
>> This morning, the "*cpu usage greater than 90 pct*" alert fired (and was
>> sent to AlertManager that emailed several people), but the 70% one did not
>> fire.  Upon further investigation of Prometheus DB (via /graph GUI), I see
>> that cpu% was never greater than ever 40% on any node for several days.
>> This seems to be a false positive alarm.
>>
>> Is there a way for me to debug the root cause?
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/4fd6c42e-c807-4f99-ab25-d0ab1faa1267n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/4fd6c42e-c807-4f99-ab25-d0ab1faa1267n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5ysGAijuXJB0Q-3-DOH64sLrbfGUR7%2BZ2omCZz8cWZ2Lw%40mail.gmail.com.

Re: [prometheus-users] Re: How to debug possible false positive alarm?

Reply via email to