Hi,
we have a set of high-cardinality metrics and currently design recording
rules, primarily to improve dashboard performance.
At a certain threshold, we observe group evaluation times exceeding the
interval, thus leading to iteration misses [1].
In these cases, we can also see that the next iteration starts at the end
of the last evaluation plus the interval. So the the iteration is not
really skipped but rather delayed (the schedule has a lag).
What is the impact of this? Do we need to worry about iteration misses?
To be more concrete, here is one of our rule groups:
groups:
- name: http_server_requests_seconds_bucket
rules:
- record:
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
expr: sum by(app, method, uri, status, le)
(rate(http_server_requests_seconds_bucket[1m]))
- record: app_le:http_server_requests_seconds_bucket:rate1m
expr: sum by(app, le)
(app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m)
The scrape interval is set to 15s, the evaluation interval to 30s.
With ~3Mio time series [2], we see evaluation times of ~1m.
[1] We use "prometheus_rule_group_iterations_missed_total" to monitor
iteration misses
[2] We have a little test tool to simulate load on prometheus before
rolling this out. We're trying to find limits of a single prometheus
instance before scaling horizontally (federation) or reaching for e.g.,
Thanos, Cortex.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/33cb8c8e-7d88-4f27-818f-2ddf0a4bab94n%40googlegroups.com.