Hey Bjorn, Firstly apologies for the slow response- I very much appreciate you taking the time to respond.
I like the overlapping evaluations idea- I will increase the evaluation interval and give this a try. I think the reduced precision at the start/ end of an evaluation period would be acceptable in my case and I like the idea of modifying the rule duration to account for this. Cheers! On Friday, 6 May 2022 at 02:49:14 UTC+10 [email protected] wrote: > On 27.04.22 17:27, 'James Luck' via Prometheus Users wrote: > > > > # Example for 1 hour lookback > > record: my_rule:rate_1h > > expr: sum(rate(my_large_metric[1h])) > > > > When the underlying metric has many labels and a very high cardinality, > the > > cost of re-aggregating the metric becomes significant. I'm trying to > offset > > this cost using an approach where I aggregate recording rules of a > shorter > > duration over time, e.g: > > > > # Aggregate + rate large metric > > record: my_rule:rate_1h > > expr: sum(rate(my_large_metric[1h])) > > # Combine 1h samples together, avoiding cost of sum() > > record: my_rule:rate_1d > > expr: avg_over_time(my_rule:rate_1h[24h:1h]) > > > > Now this leads to inaccuracy between the recording rule values and the > > equivalent rate()-based expression due to the fact that rate will miss > > increases that happen between prior invocations (effectively the problem > > mentioned > > here: > https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule > ). > > I would just do `avg_over_time(my_rule:rate_1h[24h]). It will include > many "overlapping" evaluations into the average: Instead of using 24 > data points from your recording rule, it will use as many as there are > (depending on your rule evaluation interval). The performance impact > should be manageable, and it avoids the problem of missing increases > in between perfectly spaced ranges. It introduces another error, > though: At the beginning and end of the 24h range, you get fewer > overlapping evaluations, so the first and last 1h of the total range > (which is actually 25h long, if you look at it precisely), is > gradually less weighted than the rest. You can further reduce this > error by having a large delta between the short and the long > range. For example, if you have a 15s scrape interval and rule > evaluation interval, you could record a `my_rule:rate_1m` without > problem. Then the error in `avg_over_time(my_rule:rate_1m[1d])` will > be very small. > > Another approach would be to keep the subquery and shorten the inner > range by one evaluation interval. For example, assuming a 1m rule > evaluation interval, you could do > `avg_over_time(my_rule:rate_1h[24h:59m])`. As long as your rule has > always been evaluated at the right point in time, this should be > mathematically precise. However, data points might be missing, or the > evaluation time might have a jitter, and there might even be weird > things happening in the original time series you have calculated > `my_rule:rate_1h` from. So very generally, I'd go with the first > approach as the more robust one. > > -- > Björn Rabenstein > [PGP-ID] 0x851C3DA17D748D03 > [email] [email protected] > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d39d925d-1ef4-44dc-8ab8-d3a93a263722n%40googlegroups.com.

