Hey all, I've got a problem I'm trying to tackle and I would appreciate any 
ideas or feedback.

I'm developing a set of recording rules that do different calculations over 
various durations =(e.g. rate_2m, rate_30m, rate_1h, rate_6h, rate_12h, 
rate_1d, rate_3d):

# Example for 1 hour lookback
record: my_rule:rate_1h 
expr: sum(rate(my_large_metric[1h]))

When the underlying metric has many labels and a very high cardinality, the 
cost of re-aggregating the metric becomes significant. I'm trying to offset 
this cost using an approach where I aggregate recording rules of a shorter 
duration over time, e.g:

# Aggregate + rate large metric 
record: my_rule:rate_1h 
expr: sum(rate(my_large_metric[1h])) 
# Combine 1h samples together, avoiding cost of sum() 
record: my_rule:rate_1d 
expr: avg_over_time(my_rule:rate_1h[24h:1h])

Now this leads to inaccuracy between the recording rule values and the 
equivalent rate()-based expression due to the fact that rate will miss 
increases that happen between prior invocations (effectively the problem 
mentioned 
here: 
https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule).

Is there a way to avoid the performance hit and maintain accuracy? I'm 
hoping by pre-aggregating the counter values by instance might do it:

# Tracks the total count of events per scrape target 
record: instance:my_rule:sum 
expr: sum by (instance)(my_large_metric) 
# Use this count of events to calculate rate of change over any durations 
with greatly reduced aggregation costs 
record: my_rule:rate_1h 
expr: sum(rate(instance:my_rule:sum[1h])) 
record: my_rule:rate_1d 
expr: sum(rate(instance:my_rule:sum[1d]))

Now this does violate the principles outlined 
in https://www.robustperception.io/rate-then-sum-never-sum-then-rate but it 
does (I believe) avoid counter resets causing issues by aggregating 
per-instance. Other potential issues I can see with this:
- Removing labeled series might trigger a counter reset
- Slightly increased risk of counter underflow if summed values exceed 2^53

Curious to know what people's thoughts on this are.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/764be600-0409-43c0-8a69-69a79efb7e61n%40googlegroups.com.

Reply via email to