Re: [prometheus-users] Extracting long queries from multiple histograms

Julius Volz Thu, 21 Apr 2022 10:21:21 -0700

On Wed, Apr 20, 2022 at 10:25 PM Victor Sudakov <[email protected]> wrote:

> Victor Sudakov wrote:
> >
> > There is a web app which exports its metrics as multiple histograms,
> > one histogram per Web endpoint. So each set of histogram data is also
> > labelled by the {endpoint} label. There are about 50 endpoints so
> > about 50 histograms.
> >
> > I would like to detect and graph slow endpoints, that is I would like
> > to know the value of {endpoint} when its {le} is over 1s or something
> > like that.
> >
> > Can you please help with a relevant PromQL query and an idea how to
> > represent it in Grafana?
> >
> > I don't actually want 50 heatmaps, there must be a clever way to make
> > an overview of all the slow endpoints, or all the endpoints with a
> > particular status code etc.
>
> An example. The PromQL query
> `app1_response_duration_bucket{external_endpoint="http://YY/XX
> ",status_code="200",method="GET"}`
> produces a histogram.
>
> The PromQL query
> `app1_response_duration_bucket{external_endpoint="http://YY/XX
> ",status_code="200",method="POST"}`
> produces another histogram.
>
> The query `app1_response_duration_bucket{{le="0.75"}` will return a
> list of endpoints which have responded faster than 0.75s.
>

This is not quite correct - this query gives you the le="0.75" bucket
counter for *all* endpoints, and the value of each bucket counter tells you
how many requests that endpoint has handled that completed within 0.75s
since the exposing process started tracking things.

> How do I invert the "le" and find the endpoints slower than "le"?
>

Hmm, histograms are usually used to tell you about the *distribution* of
request latencies to a given endpoint (or other label combination). So it's
unclear what you mean with an endpoint being slower than some "le" value.
Do you want to find out whether some endpoint has handled any requests *at
all* that took longer than some duration? Or only if that happened in the
last X amount of time? Or only if a certain percentage of requests were too
slow?

One thing people frequently do is to calculate percentiles / quantiles from
a histogram, for example:

    histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))

...would tell you the approximated 90th percentile latency in seconds as
averaged over a moving 5-minute window for a given label combination, which
you can then combine with a filter operator to find slow endpoints (e.g.
"... > 10" would give you those endpoints that have a 90th percentile
latency above 10s).

See also https://prometheus.io/docs/practices/histograms/ for more details
on using histograms.

Regards,
Julius

-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5y9Px7ruK7Zxfqh0iTwa-x9PnfDpWU%3DnuKyfXgmGj4R6w%40mail.gmail.com.

Re: [prometheus-users] Extracting long queries from multiple histograms

Reply via email to