Re: [prometheus-users] Recording rule displaying different results than ad-hoc querying

Julius Volz Fri, 15 May 2020 03:15:24 -0700

Yeah, that would be great, thanks!

On Fri, May 15, 2020 at 12:11 PM Per Lundberg <[email protected]> wrote:


> Interesting Marcin. Could indeed be the same root cause yes.
>
> Julius - I tried (by synchronizing 200 GiB of data to my local machine ;)
> with a more recent Prometheus version, 2.18.1, and I get the same behavior
> still.
>
> I get the feeling that this could be a bug in Prometheus. Should I perhaps
> report it via the GitHub issue tracker?
>
> Best regards,
> Per
>
> On Wednesday, May 13, 2020 at 12:09:43 PM UTC+3, Marcin Chmiel wrote:
>>
>> We're facing what I believe is the exact same issue, on v2.16.0. Although
>> we also have some intermittent failures with kube-state-metrics which
>> generates data for this query. I reckon the recording rule should either be
>> empty (due to time skew) or showing the same value. But not having such a
>> dip.
>>
>> Here's the query that's plotted, on which the recording rule is based
>>
>> count by (namespace) (kube_namespace_labels{label_xxx="123"})
>>
>> absent series shows periods where kube-state-metrics is unavailable.
>> Orange color where query and recording rules overlap.
>>
>> [image: grafana.png]
>>
>>
>> On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
>>>
>>> Strange! Have you tried a more recent Prometheus version, btw.? Just to
>>> rule that part out, since 2.13.1 is pretty old...
>>>
>>> On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <[email protected]> wrote:
>>>
>>>> With
>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, there
>>>> are more 60s spikes shown if I change to a 15s or 5s interval. With the
>>>> other query (histogram_quantile(0.99, sum by
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go
>>>> above 1.2s, oddly enough.
>>>> On 2020-04-23 15:38, Julius Volz wrote:
>>>>
>>>> Odd. Depending on time window alignment it can always be that some
>>>> spikes might appear in one graph and not another, but such a big difference
>>>> is strange. Just to make sure, what happens when you bring down the
>>>> resolution on both queries to 15s (which is your rule evaluation interval)
>>>> or lower?
>>>>
>>>> On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have been using Prometheus (2.13.1) with one of our larger customer
>>>>> installations for a while; thus far, it's been working great and we are
>>>>> very thankful for the nice piece of software that it is. (We are a 
>>>>> software
>>>>> company ourselves, using Prometheus to monitor the health of both our own
>>>>> application as well as many other relevant parts of the services 
>>>>> involved).
>>>>> Because of the volume of metrics for some of our metrics, we have a number
>>>>> of recording rules set up, to make querying of this data reasonable from
>>>>> e.g. Grafana.
>>>>>
>>>>> However, today we started some really strange behavior after a planned
>>>>> restart on one of the Tomcat-based application services we are monitoring.
>>>>> Some requests *seems* to be peaking at 60s (indicating a problem in
>>>>> our application backend), but the strange thing here is that our recording
>>>>> rules produce very different results than just running the same queries in
>>>>> the Prometheus console.
>>>>>
>>>>> Here is how the recording rule has been defined in a
>>>>> custom_recording_rules.yml file:
>>>>>
>>>>>   - name: hbx_controller_action_global
>>>>>     rules:
>>>>>       - record:
>>>>> global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
>>>>>         expr: histogram_quantile(0.5, sum by
>>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>>       - record:
>>>>> global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
>>>>>         expr: histogram_quantile(0.75, sum by
>>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>>       - record:
>>>>> global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
>>>>>         expr: histogram_quantile(0.95, sum by
>>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>>       - record:
>>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>>>         expr: histogram_quantile(0.99, sum by
>>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>>
>>>>> Querying
>>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>>> yields an output like this:
>>>>>
>>>>>
>>>>> However, running the individual query gives a completely different
>>>>> view of this data. Note how the 60-second peaks are completely gone in 
>>>>> this
>>>>> screenshot:
>>>>>
>>>>>
>>>>> I don't really know what to make out of this. Are we doing something
>>>>> fundamentally wrong here in how our recording rules are set up, or could
>>>>> this be a bug in Prometheus (unlikely)? Btw, we have the
>>>>> evaluation_interval set to 15s globally.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Best regards,
>>>>> Per
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/15d9d01f-263e-4803-8159-a33f1a8b0cfa%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/15d9d01f-263e-4803-8159-a33f1a8b0cfa%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CA%2BT6YozPgGgmXV-OxVw5zuPFHz877szfzjonAuXb2uv1_e3Qrw%40mail.gmail.com.

Re: [prometheus-users] Recording rule displaying different results than ad-hoc querying

Reply via email to