Now that Prometheus supports isolation, it shouldn't impact the results of
the query for evaluation to take longer than the interval. But it will
impact your memory and performance.
One possible solution, you could shard your rule evaluation by one of your
labels. For example, by app. I've done this for a couple of our services
that have similar naming/cardinality issues.
groups:
- name: http_server_requests_seconds_bucket{app="foo"}
rules:
- record:
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
expr: |
sum by (app, method, uri, status, le) (
rate(http_server_requests_seconds_bucket{app="foo"}[1m])
)
- record: app_le:http_server_requests_seconds_bucket:rate1m
expr: |
sum by(app, le) (
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m{app="foo"}
)
- name: http_server_requests_seconds_bucket{app="bar"}
rules:
- record:
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
expr: |
sum by (app, method, uri, status, le) (
rate(http_server_requests_seconds_bucket{app="bar"}[1m])
)
- record: app_le:http_server_requests_seconds_bucket:rate1m
expr: |
sum by(app, le) (
app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m{app="bar"}
)
Note, the name in the rule group is just an identifier, and has no impact
on the rule eval. It just needs to be different per group.
On Mon, Nov 30, 2020 at 3:34 PM Julian Maicher <[email protected]> wrote:
> Hi,
>
> we have a set of high-cardinality metrics and currently design recording
> rules, primarily to improve dashboard performance.
> At a certain threshold, we observe group evaluation times exceeding the
> interval, thus leading to iteration misses [1].
> In these cases, we can also see that the next iteration starts at the end
> of the last evaluation plus the interval. So the the iteration is not
> really skipped but rather delayed (the schedule has a lag).
>
> What is the impact of this? Do we need to worry about iteration misses?
>
> To be more concrete, here is one of our rule groups:
>
> groups:
> - name: http_server_requests_seconds_bucket
> rules:
> - record:
> app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
> expr: sum by(app, method, uri, status, le)
> (rate(http_server_requests_seconds_bucket[1m]))
> - record: app_le:http_server_requests_seconds_bucket:rate1m
> expr: sum by(app, le)
> (app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m)
>
> The scrape interval is set to 15s, the evaluation interval to 30s.
> With ~3Mio time series [2], we see evaluation times of ~1m.
>
> [1] We use "prometheus_rule_group_iterations_missed_total" to monitor
> iteration misses
> [2] We have a little test tool to simulate load on prometheus before
> rolling this out. We're trying to find limits of a single prometheus
> instance before scaling horizontally (federation) or reaching for e.g.,
> Thanos, Cortex.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/33cb8c8e-7d88-4f27-818f-2ddf0a4bab94n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/33cb8c8e-7d88-4f27-818f-2ddf0a4bab94n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/CABbyFmrfprG--%2Bu0_04kd4Lkj1N%3DEuLsjWMGZ%3DmYjNTf9Kg%3Djw%40mail.gmail.com.