Hi,

we have a set of high-cardinality metrics and currently design recording 
rules, primarily to improve dashboard performance.
At a certain threshold, we observe group evaluation times exceeding the 
interval, thus leading to iteration misses [1].
In these cases, we can also see that the next iteration starts at the end 
of the last evaluation plus the interval. So the the iteration is not 
really skipped but rather delayed (the schedule has a lag).

What is the impact of this? Do we need to worry about iteration misses?

To be more concrete, here is one of our rule groups:

groups:
- name: http_server_requests_seconds_bucket
  rules:
  - record: 
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
    expr: sum by(app, method, uri, status, le) 
(rate(http_server_requests_seconds_bucket[1m]))
  - record: app_le:http_server_requests_seconds_bucket:rate1m
    expr: sum by(app, le) 
(app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m)

The scrape interval is set to 15s, the evaluation interval to 30s.
With ~3Mio time series [2], we see evaluation times of ~1m.

[1] We use "prometheus_rule_group_iterations_missed_total" to monitor 
iteration misses
[2] We have a little test tool to simulate load on prometheus before 
rolling this out. We're trying to find limits of a single prometheus 
instance before scaling horizontally (federation) or reaching for e.g., 
Thanos, Cortex.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/33cb8c8e-7d88-4f27-818f-2ddf0a4bab94n%40googlegroups.com.

Reply via email to