Hi,
We have been using Prometheus (2.13.1) with one of our larger customer
installations for a while; thus far, it's been working great and we are
very thankful for the nice piece of software that it is. (We are a software
company ourselves, using Prometheus to monitor the health of both our own
application as well as many other relevant parts of the services involved).
Because of the volume of metrics for some of our metrics, we have a number
of recording rules set up, to make querying of this data reasonable from
e.g. Grafana.
However, today we started some really strange behavior after a planned
restart on one of the Tomcat-based application services we are monitoring.
Some requests *seems* to be peaking at 60s (indicating a problem in our
application backend), but the strange thing here is that our recording
rules produce very different results than just running the same queries in
the Prometheus console.
Here is how the recording rule has been defined in a
custom_recording_rules.yml file:
- name: hbx_controller_action_global
rules:
- record:
global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
expr: histogram_quantile(0.5, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
expr: histogram_quantile(0.75, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
expr: histogram_quantile(0.95, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
expr: histogram_quantile(0.99, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
Querying global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
yields an output like this:
However, running the individual query gives a completely different view of
this data. Note how the 60-second peaks are completely gone in this
screenshot:
I don't really know what to make out of this. Are we doing something
fundamentally wrong here in how our recording rules are set up, or could
this be a bug in Prometheus (unlikely)? Btw, we have the evaluation_interval
set to 15s globally.
Thanks in advance.
Best regards,
Per
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com.