Odd. Depending on time window alignment it can always be that some spikes might appear in one graph and not another, but such a big difference is strange. Just to make sure, what happens when you bring down the resolution on both queries to 15s (which is your rule evaluation interval) or lower?
On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]> wrote: > Hi, > > We have been using Prometheus (2.13.1) with one of our larger customer > installations for a while; thus far, it's been working great and we are > very thankful for the nice piece of software that it is. (We are a software > company ourselves, using Prometheus to monitor the health of both our own > application as well as many other relevant parts of the services involved). > Because of the volume of metrics for some of our metrics, we have a number > of recording rules set up, to make querying of this data reasonable from > e.g. Grafana. > > However, today we started some really strange behavior after a planned > restart on one of the Tomcat-based application services we are monitoring. > Some requests *seems* to be peaking at 60s (indicating a problem in our > application backend), but the strange thing here is that our recording > rules produce very different results than just running the same queries in > the Prometheus console. > > Here is how the recording rule has been defined in a > custom_recording_rules.yml file: > > - name: hbx_controller_action_global > rules: > - record: > global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m > expr: histogram_quantile(0.5, sum by > (le)(rate(hbx_controller_action_seconds_bucket[1m]))) > - record: > global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m > expr: histogram_quantile(0.75, sum by > (le)(rate(hbx_controller_action_seconds_bucket[1m]))) > - record: > global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m > expr: histogram_quantile(0.95, sum by > (le)(rate(hbx_controller_action_seconds_bucket[1m]))) > - record: > global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m > expr: histogram_quantile(0.99, sum by > (le)(rate(hbx_controller_action_seconds_bucket[1m]))) > > Querying > global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m > yields an output like this: > > > However, running the individual query gives a completely different view of > this data. Note how the 60-second peaks are completely gone in this > screenshot: > > > I don't really know what to make out of this. Are we doing something > fundamentally wrong here in how our recording rules are set up, or could > this be a bug in Prometheus (unlikely)? Btw, we have the > evaluation_interval set to 15s globally. > > Thanks in advance. > > Best regards, > Per > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CA%2BT6YoyEvw6E27mpfU8daKS_YA4pFq0PLw5CCGkw4tPezCN4Tw%40mail.gmail.com.

