Thanks for your response. I gained some idea to how to approach such querys. By 500 i meant the number of requests and not the error code(my wrong choice of wording did impy that ),. So if i want to alert on if 1000 requests takes more than avg of 3 seconds within a 5 min interval, .If i take http_request_duration_seconds_sum for example,
avg_over_time( (http_request_duration_seconds_sum and ( http_request_duration_seconds_count >= 1000 ))[5m] ) > 3 Is this query right or i am missing something? On Tuesday, 28 April 2020 13:53:05 UTC+5:30, Brian Candler wrote: > > Do you mean requests with result status code 500? > > This is a bit tricky. First thing you have to be careful of is that > "probe_http_duration_seconds" is not the total, it's broken down into > phases, as you can see if you try the exporter with curl: > > $ > *curl > 'localhost:9115/probe?module=http_2xx_example&target=https:%2f%2fwww.google.com > > <http://2fwww.google.com>'* > ... > probe_duration_seconds 0.471663605 > ... > probe_http_duration_seconds{phase="connect"} 0.010641254 > probe_http_duration_seconds{phase="processing"} 0.046997224 > probe_http_duration_seconds{phase="resolve"} 0.001434721 > probe_http_duration_seconds{phase="tls"} 0.421022725 > probe_http_duration_seconds{phase="transfer"} 0.001299392 > ... > probe_http_status_code 200 > > So really you should be using probe_duration_seconds which is the total > time. > > Now, you can generate a filtered query like this: > > probe_duration_seconds and (probe_http_status_code == 500) > > The logical operators are described here > <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators> > and > depend on the LHS and RHS having the same set of labels, unless you start > doing grouping. This should return only LHS data points where the RHS has > a data point. > > The trouble is, to do avg_over_time on that you'll need a subquery: > > avg_over_time( (probe_duration_seconds and (probe_http_status_code == > 500))[5m:1m] ) > 3 > > Subqueries will resample your data - in the above example it will take 1 > minute steps over 5 minutes. So you need to align this with whatever > scraping rate you are using. It might be good enough. > > In general, a 500 error means your server is failing. If you're getting a > noticeable number of 500 errors during a 5 minute period, you probably have > bigger problems to worry about than the response time! That is you should > fix the error, not worry about how long the error takes to be returned. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b2e25e55-9655-4e32-8e55-09f2dd9bbb3b%40googlegroups.com.

