sorry, to clarify, I guess by rate what I mean is the % of errors compared to total requests, IE if the error rate is more than 10% of total requests we could label it as a warning alert, if over 30% then a critical/outage (for example) - so yes, the ratio of errors to total requests!
The label issue is, to quote my colleague: " the issue is one metric has differnt lables to the other. this means prometheus cant match up the metrics as lables dont match" I suppose we could strip the labels but then we ocse context like status code for example. On Wednesday, 21 October 2020 at 09:49:07 UTC+1 [email protected] wrote: > Hey again, > > do you mean by "rate of errors" the ratio between errors and the total > number of requests? If it is just the rate (as in the number of errors per > second) you can just replace `increase` with `rate`. This will give you the > errors per second averaged over the last 5 minutes. > > How does the label mismatch manifest itself? Is it just the label names or > do the values differ as well? Can you post the respective labels of > interest to you? > > [email protected] schrieb am Mittwoch, 21. Oktober 2020 um 10:28:10 > UTC+2: > >> # Caculates HTTP error Responses total >> - record: windows:windows_iis_worker_request_errors_total:irate5m >> expr: irate(windows_iis_worker_request_errors_total[5m]) >> >> - alert: IIS error requests rate >> expr: >> sum without () >> (rate(windows:windows_iis_worker_request_errors_total:irate5m{status_code!="401"}[5m])) >> > 3 >> for: 5m >> labels: >> severity: critical >> component: WindowsOS >> annotations: >> summary: "High IIS worker error rate" >> description: >> "IIS http responses on {{ if $labels.fqdn }}{{ $labels.fqdn }}{{ else }}{{ >> $labels.instance }}{{ end }}for {{ $labels.app }} has high rate of errors." >> dashboard: >> runbook: >> >> I'm trying to do something like this to alert on when people are getting >> errors whilst trying to connect to a webapp, the issue is the query itself ' >> windows_iis_worker_request_errors_total:irate5m' is returning non >> integer values >> >> The idea was to evaluate over a rolling 5 minute window the number of >> errors. >> >> of course in an ideal world I'd alert on the rate of errors using the >> total requests metrics and dividing, however the two metrics have a label >> mismatch and I am unsure how to perform that query. >> >> Would really appreciate any assistance! >> >> edit: >> >> Someone in the Prometheus developer group provided me with the >> followering query which does work: >> >> sum by (fqdn, instance, app) >> (increase(windows_iis_worker_request_errors_total{status_code!="401"}[5m])) >> >> However I was wondering if someone would still know how to get a query >> working on the rate of errors rather than the increase in count despite the >> label mismatch between the IIS total requests and IIS error request metrics. >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/57c9c354-f4c8-4a4f-b8d7-a78a9745da4fn%40googlegroups.com.

