Hey Bjorn,

Firstly apologies for the slow response- I very much appreciate you taking 
the time to respond.

I like the overlapping evaluations idea- I will increase the evaluation 
interval and give this a try. I think the reduced precision at the start/ 
end of an evaluation period would be acceptable in my case and I like the 
idea of modifying the rule duration to account for this.

Cheers!


On Friday, 6 May 2022 at 02:49:14 UTC+10 [email protected] wrote:

> On 27.04.22 17:27, 'James Luck' via Prometheus Users wrote:
> > 
> > # Example for 1 hour lookback
> > record: my_rule:rate_1h 
> > expr: sum(rate(my_large_metric[1h]))
> > 
> > When the underlying metric has many labels and a very high cardinality, 
> the 
> > cost of re-aggregating the metric becomes significant. I'm trying to 
> offset 
> > this cost using an approach where I aggregate recording rules of a 
> shorter 
> > duration over time, e.g:
> > 
> > # Aggregate + rate large metric 
> > record: my_rule:rate_1h 
> > expr: sum(rate(my_large_metric[1h])) 
> > # Combine 1h samples together, avoiding cost of sum() 
> > record: my_rule:rate_1d 
> > expr: avg_over_time(my_rule:rate_1h[24h:1h])
> > 
> > Now this leads to inaccuracy between the recording rule values and the 
> > equivalent rate()-based expression due to the fact that rate will miss 
> > increases that happen between prior invocations (effectively the problem 
> > mentioned 
> > here: 
> https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule
> ).
>
> I would just do `avg_over_time(my_rule:rate_1h[24h]). It will include
> many "overlapping" evaluations into the average: Instead of using 24
> data points from your recording rule, it will use as many as there are
> (depending on your rule evaluation interval). The performance impact
> should be manageable, and it avoids the problem of missing increases
> in between perfectly spaced ranges. It introduces another error,
> though: At the beginning and end of the 24h range, you get fewer
> overlapping evaluations, so the first and last 1h of the total range
> (which is actually 25h long, if you look at it precisely), is
> gradually less weighted than the rest. You can further reduce this
> error by having a large delta between the short and the long
> range. For example, if you have a 15s scrape interval and rule
> evaluation interval, you could record a `my_rule:rate_1m` without
> problem. Then the error in `avg_over_time(my_rule:rate_1m[1d])` will
> be very small.
>
> Another approach would be to keep the subquery and shorten the inner
> range by one evaluation interval. For example, assuming a 1m rule
> evaluation interval, you could do
> `avg_over_time(my_rule:rate_1h[24h:59m])`. As long as your rule has
> always been evaluated at the right point in time, this should be
> mathematically precise. However, data points might be missing, or the
> evaluation time might have a jitter, and there might even be weird
> things happening in the original time series you have calculated
> `my_rule:rate_1h` from. So very generally, I'd go with the first
> approach as the more robust one.
>
> -- 
> Björn Rabenstein
> [PGP-ID] 0x851C3DA17D748D03
> [email] [email protected]
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d39d925d-1ef4-44dc-8ab8-d3a93a263722n%40googlegroups.com.

Reply via email to