[prometheus-users] Re: struggling with alertmanager query

Kavan Mccanaan Wed, 21 Oct 2020 02:19:39 -0700

sorry, to clarify, I guess by rate what I mean is the % of errors compared 
to total requests, IE if the error rate is more than 10% of total requests 
we could label it as a warning alert, if over 30% then a critical/outage 
(for example) - so yes, the ratio of errors to total requests!


The label issue is, to quote my colleague: "
the issue is one metric has differnt lables to the other. this means 
prometheus cant match up the metrics as lables dont match"

I suppose we could strip the labels but then we ocse context like status 
code for example.
On Wednesday, 21 October 2020 at 09:49:07 UTC+1 [email protected] wrote:

> Hey again,
>
> do you mean by "rate of errors" the ratio between errors and the total 
> number of requests? If it is just the rate (as in the number of errors per 
> second) you can just replace `increase` with `rate`. This will give you the 
> errors per second averaged over the last 5 minutes.
>
> How does the label mismatch manifest itself? Is it just the label names or 
> do the values differ as well? Can you post the respective labels of 
> interest to you? 
>
> [email protected] schrieb am Mittwoch, 21. Oktober 2020 um 10:28:10 
> UTC+2:
>
>> # Caculates HTTP error Responses total 
>>   - record: windows:windows_iis_worker_request_errors_total:irate5m
>>     expr: irate(windows_iis_worker_request_errors_total[5m])
>>
>>   - alert: IIS error requests rate
>>     expr: 
>> sum without () 
>> (rate(windows:windows_iis_worker_request_errors_total:irate5m{status_code!="401"}[5m]))
>>  > 3
>>     for: 5m
>>     labels:
>>       severity: critical
>>       component: WindowsOS
>>     annotations:
>>       summary: "High IIS worker error rate"
>>       description: 
>> "IIS http responses on {{ if $labels.fqdn }}{{ $labels.fqdn }}{{ else }}{{ 
>> $labels.instance }}{{ end }}for {{ $labels.app }} has high rate of errors."
>>       dashboard:
>>       runbook:
>>
>> I'm trying to do something like this to alert on when people are getting 
>> errors whilst trying to connect to a webapp, the issue is the query itself '
>> windows_iis_worker_request_errors_total:irate5m' is returning non 
>> integer values
>>
>> The idea was to evaluate over a rolling 5 minute window the number of 
>> errors.
>>
>> of course in an ideal world I'd alert on the rate of errors using the 
>> total requests metrics and dividing, however the two metrics have a label 
>> mismatch and I am unsure how to perform that query.
>>
>> Would really appreciate any assistance!
>>
>> edit:
>>
>> Someone in the Prometheus developer group provided me with the 
>> followering query which does work:
>>
>> sum by (fqdn, instance, app) 
>> (increase(windows_iis_worker_request_errors_total{status_code!="401"}[5m]))
>>
>> However I was wondering if someone would still know how to get a query 
>> working on the rate of errors rather than the increase in count despite the 
>> label mismatch between the IIS total requests and IIS error request metrics.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/57c9c354-f4c8-4a4f-b8d7-a78a9745da4fn%40googlegroups.com.

[prometheus-users] Re: struggling with alertmanager query

Reply via email to