Hello all, *TL;DR: *measuring `http_request_duration_seconds` on the query path is a bad proxy for query latency as it does not account for data distribution and number of samples/series touched by a query (both of which have significant implications on the performance of a query)
--- I'm exploring more granular performance metrics for prom queries <https://github.com/thanos-io/thanos/issues/4895> downstream in Thanos (inspired by this discussion from Ian Billet <https://github.com/thanos-io/thanos/discussions/4674>) and wanted to reach out to the Prometheus developer community for ideas on how people are measuring and tracking query performance systematically. The aim is to create a new metric that captures these additional dimensions with respect to the query to better understand/quantify query performance SLI's with respect to number of samples/series touched *before* a query is executed. The current solution I have arrived at is crude n-dimensional histogram, where query_duration is observed/bucketed with labels representing some scale (simplified to t-shirt sizes) of samples touched and series queried. This would allow me to query for query_duration quantiles for some ranges of sample/series sizes (e.g. 90% of queries for up to 1,000,000 samples and up to 10 series complete in less than 2s) I would love to hear about other approaches members of the community have taken for capturing this level of performance granularity in a metric (as well as stir the pot wrt the thanos proposal). Thanks, Moad. -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/8f2c3a95-1ba1-416c-a49b-97c39538f6e3n%40googlegroups.com.

