With
global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m,
there are more 60s spikes shown if I change to a 15s or 5s interval.
With the other query (histogram_quantile(0.99, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't
go above 1.2s, oddly enough.
On 2020-04-23 15:38, Julius Volz wrote:
Odd. Depending on time window alignment it can always be that some
spikes might appear in one graph and not another, but such a big
difference is strange. Just to make sure, what happens when you bring
down the resolution on both queries to 15s (which is your rule
evaluation interval) or lower?
On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]
<mailto:[email protected]>> wrote:
Hi,
We have been using Prometheus (2.13.1) with one of our larger
customer installations for a while; thus far, it's been working
great and we are very thankful for the nice piece of software that
it is. (We are a software company ourselves, using Prometheus to
monitor the health of both our own application as well as many
other relevant parts of the services involved). Because of the
volume of metrics for some of our metrics, we have a number of
recording rules set up, to make querying of this data reasonable
from e.g. Grafana.
However, today we started some really strange behavior after a
planned restart on one of the Tomcat-based application services we
are monitoring. Some requests /seems/ to be peaking at 60s
(indicating a problem in our application backend), but the strange
thing here is that our recording rules produce very different
results than just running the same queries in the Prometheus console.
Here is how the recording rule has been defined in a
custom_recording_rules.yml file:
- name: hbx_controller_action_global
rules:
- record:
global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
expr: histogram_quantile(0.5, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
expr: histogram_quantile(0.75, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
expr: histogram_quantile(0.95, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
- record:
global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
expr: histogram_quantile(0.99, sum by
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
Querying
global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
yields an output like this:
However, running the individual query gives a completely different
view of this data. Note how the 60-second peaks are completely
gone in this screenshot:
I don't really know what to make out of this. Are we doing
something fundamentally wrong here in how our recording rules are
set up, or could this be a bug in Prometheus (unlikely)? Btw, we
have the evaluation_interval set to 15s globally.
Thanks in advance.
Best regards,
Per
--
You received this message because you are subscribed to the Google
Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
<https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/53d723c0-708a-7c14-8b82-80a68802612c%40hibox.tv.