Hey all, I've got a problem I'm trying to tackle and I would appreciate any ideas or feedback.
I'm developing a set of recording rules that do different calculations over various durations =(e.g. rate_2m, rate_30m, rate_1h, rate_6h, rate_12h, rate_1d, rate_3d): # Example for 1 hour lookback record: my_rule:rate_1h expr: sum(rate(my_large_metric[1h])) When the underlying metric has many labels and a very high cardinality, the cost of re-aggregating the metric becomes significant. I'm trying to offset this cost using an approach where I aggregate recording rules of a shorter duration over time, e.g: # Aggregate + rate large metric record: my_rule:rate_1h expr: sum(rate(my_large_metric[1h])) # Combine 1h samples together, avoiding cost of sum() record: my_rule:rate_1d expr: avg_over_time(my_rule:rate_1h[24h:1h]) Now this leads to inaccuracy between the recording rule values and the equivalent rate()-based expression due to the fact that rate will miss increases that happen between prior invocations (effectively the problem mentioned here: https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule). Is there a way to avoid the performance hit and maintain accuracy? I'm hoping by pre-aggregating the counter values by instance might do it: # Tracks the total count of events per scrape target record: instance:my_rule:sum expr: sum by (instance)(my_large_metric) # Use this count of events to calculate rate of change over any durations with greatly reduced aggregation costs record: my_rule:rate_1h expr: sum(rate(instance:my_rule:sum[1h])) record: my_rule:rate_1d expr: sum(rate(instance:my_rule:sum[1d])) Now this does violate the principles outlined in https://www.robustperception.io/rate-then-sum-never-sum-then-rate but it does (I believe) avoid counter resets causing issues by aggregating per-instance. Other potential issues I can see with this: - Removing labeled series might trigger a counter reset - Slightly increased risk of counter underflow if summed values exceed 2^53 Curious to know what people's thoughts on this are. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/764be600-0409-43c0-8a69-69a79efb7e61n%40googlegroups.com.

