Thanks for your response. I gained some idea to how to approach such 
querys. By 500 i meant the number of requests and not the error code(my 
wrong choice of wording did impy that ),. So if i want to alert on if 1000 
requests takes more than avg of 3 seconds within a 5 min interval, .If i 
take http_request_duration_seconds_sum for example, 

  avg_over_time( (http_request_duration_seconds_sum and (
http_request_duration_seconds_count >= 1000 ))[5m] ) > 3

Is this query right or i am missing something?

On Tuesday, 28 April 2020 13:53:05 UTC+5:30, Brian Candler wrote:
>
> Do you mean requests with result status code 500?
>
> This is a bit tricky.  First thing you have to be careful of is that 
> "probe_http_duration_seconds" is not the total, it's broken down into 
> phases, as you can see if you try the exporter with curl:
>
> $ 
> *curl 
> 'localhost:9115/probe?module=http_2xx_example&target=https:%2f%2fwww.google.com
>  
> <http://2fwww.google.com>'*
> ...
> probe_duration_seconds 0.471663605
> ...
> probe_http_duration_seconds{phase="connect"} 0.010641254
> probe_http_duration_seconds{phase="processing"} 0.046997224
> probe_http_duration_seconds{phase="resolve"} 0.001434721
> probe_http_duration_seconds{phase="tls"} 0.421022725
> probe_http_duration_seconds{phase="transfer"} 0.001299392
> ...
> probe_http_status_code 200
>
> So really you should be using probe_duration_seconds which is the total 
> time.
>
> Now, you can generate a filtered query like this:
>
>     probe_duration_seconds and (probe_http_status_code == 500)
>
> The logical operators are described here 
> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
>  and 
> depend on the LHS and RHS having the same set of labels, unless you start 
> doing grouping.  This should return only LHS data points where the RHS has 
> a data point.
>
> The trouble is, to do avg_over_time on that you'll need a subquery:
>
>     avg_over_time( (probe_duration_seconds and (probe_http_status_code == 
> 500))[5m:1m] ) > 3
>
> Subqueries will resample your data - in the above example it will take 1 
> minute steps over 5 minutes.  So you need to align this with whatever 
> scraping rate you are using.  It might be good enough.
>
> In general, a 500 error means your server is failing.  If you're getting a 
> noticeable number of 500 errors during a 5 minute period, you probably have 
> bigger problems to worry about than the response time!  That is you should 
> fix the error, not worry about how long the error takes to be returned.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b2e25e55-9655-4e32-8e55-09f2dd9bbb3b%40googlegroups.com.

Reply via email to