Hi,

We have been using Prometheus (2.13.1) with one of our larger customer 
installations for a while; thus far, it's been working great and we are 
very thankful for the nice piece of software that it is. (We are a software 
company ourselves, using Prometheus to monitor the health of both our own 
application as well as many other relevant parts of the services involved). 
Because of the volume of metrics for some of our metrics, we have a number 
of recording rules set up, to make querying of this data reasonable from 
e.g. Grafana.

However, today we started some really strange behavior after a planned 
restart on one of the Tomcat-based application services we are monitoring. 
Some requests *seems* to be peaking at 60s (indicating a problem in our 
application backend), but the strange thing here is that our recording 
rules produce very different results than just running the same queries in 
the Prometheus console.

Here is how the recording rule has been defined in a 
custom_recording_rules.yml file:

  - name: hbx_controller_action_global
    rules:
      - record: 
global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
        expr: histogram_quantile(0.5, sum by 
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: 
global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
        expr: histogram_quantile(0.75, sum by 
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: 
global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
        expr: histogram_quantile(0.95, sum by 
(le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: 
global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
        expr: histogram_quantile(0.99, sum by 
(le)(rate(hbx_controller_action_seconds_bucket[1m])))

Querying global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m 
yields an output like this:


However, running the individual query gives a completely different view of 
this data. Note how the 60-second peaks are completely gone in this 
screenshot:


I don't really know what to make out of this. Are we doing something 
fundamentally wrong here in how our recording rules are set up, or could 
this be a bug in Prometheus (unlikely)? Btw, we have the evaluation_interval 
set to 15s globally.

Thanks in advance.

Best regards,
Per

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com.

Reply via email to