[prometheus-users] Re: struggling with alertmanager query

Tim Schwenke Wed, 21 Oct 2020 03:58:32 -0700

Well I can't give you concrete tips if I don't see the labels. But 
generally you can use `label_join()` and `label_replace()` in PromQL to 
work around mismatching labels. <
https://prometheus.io/docs/prometheus/latest/querying/functions/#label_join>


[email protected] schrieb am Mittwoch, 21. Oktober 2020 um 11:19:17 UTC+2:

> sorry, to clarify, I guess by rate what I mean is the % of errors compared 
> to total requests, IE if the error rate is more than 10% of total requests 
> we could label it as a warning alert, if over 30% then a critical/outage 
> (for example) - so yes, the ratio of errors to total requests! 
>
> The label issue is, to quote my colleague: "
> the issue is one metric has differnt lables to the other. this means 
> prometheus cant match up the metrics as lables dont match"
>
> I suppose we could strip the labels but then we ocse context like status 
> code for example.
> On Wednesday, 21 October 2020 at 09:49:07 UTC+1 [email protected] 
> wrote:
>
>> Hey again,
>>
>> do you mean by "rate of errors" the ratio between errors and the total 
>> number of requests? If it is just the rate (as in the number of errors per 
>> second) you can just replace `increase` with `rate`. This will give you the 
>> errors per second averaged over the last 5 minutes.
>>
>> How does the label mismatch manifest itself? Is it just the label names 
>> or do the values differ as well? Can you post the respective labels of 
>> interest to you? 
>>
>> [email protected] schrieb am Mittwoch, 21. Oktober 2020 um 10:28:10 
>> UTC+2:
>>
>>> # Caculates HTTP error Responses total 
>>>   - record: windows:windows_iis_worker_request_errors_total:irate5m
>>>     expr: irate(windows_iis_worker_request_errors_total[5m])
>>>
>>>   - alert: IIS error requests rate
>>>     expr: 
>>> sum without () 
>>> (rate(windows:windows_iis_worker_request_errors_total:irate5m{status_code!="401"}[5m]))
>>>  > 3
>>>     for: 5m
>>>     labels:
>>>       severity: critical
>>>       component: WindowsOS
>>>     annotations:
>>>       summary: "High IIS worker error rate"
>>>       description: 
>>> "IIS http responses on {{ if $labels.fqdn }}{{ $labels.fqdn }}{{ else }}{{ 
>>> $labels.instance }}{{ end }}for {{ $labels.app }} has high rate of errors."
>>>       dashboard:
>>>       runbook:
>>>
>>> I'm trying to do something like this to alert on when people are getting 
>>> errors whilst trying to connect to a webapp, the issue is the query itself '
>>> windows_iis_worker_request_errors_total:irate5m' is returning non 
>>> integer values
>>>
>>> The idea was to evaluate over a rolling 5 minute window the number of 
>>> errors.
>>>
>>> of course in an ideal world I'd alert on the rate of errors using the 
>>> total requests metrics and dividing, however the two metrics have a label 
>>> mismatch and I am unsure how to perform that query.
>>>
>>> Would really appreciate any assistance!
>>>
>>> edit:
>>>
>>> Someone in the Prometheus developer group provided me with the 
>>> followering query which does work:
>>>
>>> sum by (fqdn, instance, app) 
>>> (increase(windows_iis_worker_request_errors_total{status_code!="401"}[5m]))
>>>
>>> However I was wondering if someone would still know how to get a query 
>>> working on the rate of errors rather than the increase in count despite the 
>>> label mismatch between the IIS total requests and IIS error request metrics.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/765945e1-6980-4aaa-813b-76b0bfca6c8bn%40googlegroups.com.

[prometheus-users] Re: struggling with alertmanager query

Reply via email to