[prometheus-users] Re: False Positive Alerts with wmi_cpu_time_total

Brian Candler Mon, 21 Feb 2022 02:51:30 -0800

To summarize:

1. You're 100% positive that the alerting rule has
        expr: (blah) > 20

2. If you put "(blah) > 20" in the PromQL browser and and switch to graph 
mode, then it's blank

3. But alerts are still firing

In that case, you need to go into the PromQL web interface and click on 
"Alerts" at the top.  It will show you which alerts are currently firing, 
and the triggering label sets and values.

In short, it's impossible for expression "(blah) > 20" to fire if this 
expression returns an empty instant vector.  So either it's *not* an empty 
instant vector; or else some other alert expression is firing.  You didn't 
show any details from your OpsGenie messages, so it is at least possible 
that it's some other alerting rule that is causing the alerts.

You showed a graph of ALERTS{alertname="CPUSQLUtilizationWarning"} but no 
binding between that alert name and your alerting ruleset, since you didn't 
show the alert rule.

I believe you can have multiple alert rules with the same name.  Maybe 
there's a copy-paste issue when you were duplicating an existing rule?  So 
actually it's a different alert which is triggering with this name?

Finally: use promtool to check your config:

promtool check config /etc/prometheus/prometheus.yml

On Sunday, 20 February 2022 at 19:42:09 UTC [email protected] 
wrote:

>
> I already attached screenshots with the rule, the actual query 
> results(screenshot without > 20, because it doesn't show anything). The 
> threshold is 20%. But the graph doesn't reach it, nonetheless, it causes an 
> alert. 
> On Sunday, February 20, 2022 at 6:56:44 PM UTC+2 Brian Candler wrote:
>
>> As far as I can see, you haven't shown your actual alerting rule.
>>
>> However, it's straightforward to debug this: paste your entire alerting 
>> "expr" into the PromQL query interface.  Wherever the line is present, it 
>> means an alert will fire.  You can then work backwards from that to find 
>> the problem with your expr.
>>
>> For example, say you have this rule:
>>     expr: avg by (instance) 
>> (rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.8
>>
>> Paste exactly "avg by (instance) 
>> (rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.8" into the PromQL 
>> browser to see if and when it fires.
>>
>> In PromQL, the expression "foo" generates a vector: the set of all 
>> timeseries whose metric name is "foo".  Then "foo < 0.8" is a filter, not a 
>> boolean.  It filters the vector to only those whose value is less than 
>> 0.8.  When used as an alerting expression, you get an alert if the vector 
>> is not empty.
>>
>> On Sunday, 20 February 2022 at 16:38:10 UTC [email protected] 
>> wrote:
>>
>>> Hello everybody. 
>>> We are facing some issues with CPU monitoring.
>>> Our graphs don't show reaching the thresholds even one time, not for 3m.
>>> All info and screenshots will be below.
>>> Alert is configured to alert at 20%. Related only to the blue graph.
>>>
>>> [image: Screenshot 2022-02-18 133538.png]
>>>
>>> [image: Screenshot 2022-02-18 133642.png]
>>>
>>> Prometheus creates a massive amount of alerts in our Opsgenie, there are 
>>> no issues with other alerts or even with a threshold of 60%.
>>> [image: Screenshot 2022-02-18 133820.png]
>>>
>>> Alert query:
>>>
>>> [image: Screenshot 2022-02-18 134142.png]
>>>
>>> Maybe you have some suggestions on what can cause that flapping and 
>>> triggering the alert? 
>>> Already tried to check graphs by 1,2,5,10 minute, by the hour and etc, 
>>> there is nothing that should result in an alert.
>>> Also, there are no such alerts from Cloudwatch monitoring.
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c6b3fbb4-63fc-4eb1-adb3-643d9251ee93n%40googlegroups.com.

[prometheus-users] Re: False Positive Alerts with wmi_cpu_time_total

Reply via email to