Well I can't give you concrete tips if I don't see the labels. But generally you can use `label_join()` and `label_replace()` in PromQL to work around mismatching labels. < https://prometheus.io/docs/prometheus/latest/querying/functions/#label_join>
[email protected] schrieb am Mittwoch, 21. Oktober 2020 um 11:19:17 UTC+2: > sorry, to clarify, I guess by rate what I mean is the % of errors compared > to total requests, IE if the error rate is more than 10% of total requests > we could label it as a warning alert, if over 30% then a critical/outage > (for example) - so yes, the ratio of errors to total requests! > > The label issue is, to quote my colleague: " > the issue is one metric has differnt lables to the other. this means > prometheus cant match up the metrics as lables dont match" > > I suppose we could strip the labels but then we ocse context like status > code for example. > On Wednesday, 21 October 2020 at 09:49:07 UTC+1 [email protected] > wrote: > >> Hey again, >> >> do you mean by "rate of errors" the ratio between errors and the total >> number of requests? If it is just the rate (as in the number of errors per >> second) you can just replace `increase` with `rate`. This will give you the >> errors per second averaged over the last 5 minutes. >> >> How does the label mismatch manifest itself? Is it just the label names >> or do the values differ as well? Can you post the respective labels of >> interest to you? >> >> [email protected] schrieb am Mittwoch, 21. Oktober 2020 um 10:28:10 >> UTC+2: >> >>> # Caculates HTTP error Responses total >>> - record: windows:windows_iis_worker_request_errors_total:irate5m >>> expr: irate(windows_iis_worker_request_errors_total[5m]) >>> >>> - alert: IIS error requests rate >>> expr: >>> sum without () >>> (rate(windows:windows_iis_worker_request_errors_total:irate5m{status_code!="401"}[5m])) >>> > 3 >>> for: 5m >>> labels: >>> severity: critical >>> component: WindowsOS >>> annotations: >>> summary: "High IIS worker error rate" >>> description: >>> "IIS http responses on {{ if $labels.fqdn }}{{ $labels.fqdn }}{{ else }}{{ >>> $labels.instance }}{{ end }}for {{ $labels.app }} has high rate of errors." >>> dashboard: >>> runbook: >>> >>> I'm trying to do something like this to alert on when people are getting >>> errors whilst trying to connect to a webapp, the issue is the query itself ' >>> windows_iis_worker_request_errors_total:irate5m' is returning non >>> integer values >>> >>> The idea was to evaluate over a rolling 5 minute window the number of >>> errors. >>> >>> of course in an ideal world I'd alert on the rate of errors using the >>> total requests metrics and dividing, however the two metrics have a label >>> mismatch and I am unsure how to perform that query. >>> >>> Would really appreciate any assistance! >>> >>> edit: >>> >>> Someone in the Prometheus developer group provided me with the >>> followering query which does work: >>> >>> sum by (fqdn, instance, app) >>> (increase(windows_iis_worker_request_errors_total{status_code!="401"}[5m])) >>> >>> However I was wondering if someone would still know how to get a query >>> working on the rate of errors rather than the increase in count despite the >>> label mismatch between the IIS total requests and IIS error request metrics. >>> >> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/765945e1-6980-4aaa-813b-76b0bfca6c8bn%40googlegroups.com.

