Hi, Could it be that when graphing the CPU usage, the graph resolution was just low and thus it might have skipped over a short spike in the rate?
Try: max_over_time(cpu:usage[3d]) ...or something like that to make sure that you are really looking at all samples within a given time range, not just a subset depending on the graph resolution. Not sure though why the 70% one wouldn't have fired if the 90% did, assuming the alerts were in the same rule group with same intervals (and thus evaluation timestamps). Btw., you most likely want to have some "for" duration on that alert to make it less sensitive and/or also use rate() vs. irate() in the underlying recording rule to actually look at 5m worth of CPU usage vs. just at the last two samples of the 5m window. Regards, Julius On Thu, Jul 8, 2021 at 7:48 PM James S <[email protected]> wrote: > I have the same issue for one node the CPU usage, Prometheus firing false > possitive. > I checked in the GCP monitoring CPU Usage is not over 80% > > On Tuesday, October 2, 2018 at 10:14:38 AM UTC-4 [email protected] > wrote: > >> Hi, >> >> I have 2 cpu usage alerts set up in prometheus: >> >> alert: Cpu_Usage_Greater_Than_70_Pct >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=ALERTS%7Balertname%3D%22Cpu_Usage_Greater_Than_70_Pct%22%7D&g0.tab=1> >> expr: cpu:usage > >> 70 >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage+%3E+70&g0.tab=1> >> labels: >> severity: warning >> annotations: >> description: CPU Usage on these nodes is greater than 70 pct (over 5m) >> severity: warning >> summary: 'WARNING: CPU Usage is greater than 70 pct' >> >> >> alert: Cpu_Usage_Greater_Than_90_Pct >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=ALERTS%7Balertname%3D%22Cpu_Usage_Greater_Than_90_Pct%22%7D&g0.tab=1> >> expr: cpu:usage > >> 90 >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage+%3E+90&g0.tab=1> >> labels: >> severity: danger >> annotations: >> description: CPU Usage on these nodes is greater than 90 pct (over 5m) >> severity: danger >> summary: 'DANGER: CPU Usage is greater than 90 pct' >> >> >> >> Where cpu:usage is defind as: >> >> File: recording_rules.yml; Group name: Cpu Usage Percentage (over 5m) >> ------- >> record: cpu:usage >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=cpu%3Ausage&g0.tab=1> >> expr: 100 >> * (1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) BY (instance, >> job))) >> <https://monitoring.roomvo.com/prometheus/graph?g0.expr=100+%2A+%281+-+%28avg%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29%29+BY+%28instance%2C+job%29%29%29&g0.tab=1> >> >> >> >> >> This morning, the "*cpu usage greater than 90 pct*" alert fired (and was >> sent to AlertManager that emailed several people), but the 70% one did not >> fire. Upon further investigation of Prometheus DB (via /graph GUI), I see >> that cpu% was never greater than ever 40% on any node for several days. >> This seems to be a false positive alarm. >> >> Is there a way for me to debug the root cause? >> >> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/4fd6c42e-c807-4f99-ab25-d0ab1faa1267n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/4fd6c42e-c807-4f99-ab25-d0ab1faa1267n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Julius Volz PromLabs - promlabs.com -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAObpH5ysGAijuXJB0Q-3-DOH64sLrbfGUR7%2BZ2omCZz8cWZ2Lw%40mail.gmail.com.

