Hello all, 

*TL;DR: *measuring `http_request_duration_seconds` on the query path is a 
bad proxy for query latency as it does not account for data distribution 
and number of samples/series touched by a query (both of which have 
significant implications on the performance of a query) 

--- 

I'm exploring more granular performance metrics for prom queries 
<https://github.com/thanos-io/thanos/issues/4895> downstream in Thanos 
(inspired by this discussion from Ian Billet 
<https://github.com/thanos-io/thanos/discussions/4674>) and wanted to reach 
out to the Prometheus developer community for ideas on how people are 
measuring and tracking query performance systematically. 

The aim is to create a new metric that captures these additional dimensions 
with respect to the query to better understand/quantify query performance 
SLI's with respect to number of samples/series touched *before* a query is 
executed. 

The current solution I have arrived at is crude n-dimensional histogram, 
where query_duration is observed/bucketed with labels representing some 
scale (simplified to t-shirt sizes) of samples touched and series queried. 
This would allow me to query for query_duration quantiles for some ranges 
of sample/series sizes (e.g. 90% of queries for up to 1,000,000 samples and 
up to 10 series complete in less than 2s)

I would love to hear about other approaches members of the community have 
taken for capturing this level of performance granularity in a metric (as 
well as stir the pot wrt the thanos proposal).

Thanks,

Moad.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/8f2c3a95-1ba1-416c-a49b-97c39538f6e3n%40googlegroups.com.

Reply via email to