We're facing what I believe is the exact same issue, on v2.16.0. Although
we also have some intermittent failures with kube-state-metrics which
generates data for this query. I reckon the recording rule should either be
empty (due to time skew) or showing the same value. But not having such a
dip.
Here's the query that's plotted, on which the recording rule is based
count by (namespace) (kube_namespace_labels{label_xxx="123"})
absent series shows periods where kube-state-metrics is unavailable. Orange
color where query and recording rules overlap.
[image: grafana.png]
On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
>
> Strange! Have you tried a more recent Prometheus version, btw.? Just to
> rule that part out, since 2.13.1 is pretty old...
>
> On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <[email protected]
> <javascript:>> wrote:
>
>> With global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m,
>> there are more 60s spikes shown if I change to a 15s or 5s interval. With
>> the other query (histogram_quantile(0.99, sum by
>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go
>> above 1.2s, oddly enough.
>> On 2020-04-23 15:38, Julius Volz wrote:
>>
>> Odd. Depending on time window alignment it can always be that some spikes
>> might appear in one graph and not another, but such a big difference is
>> strange. Just to make sure, what happens when you bring down the resolution
>> on both queries to 15s (which is your rule evaluation interval) or lower?
>>
>> On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]
>> <javascript:>> wrote:
>>
>>> Hi,
>>>
>>> We have been using Prometheus (2.13.1) with one of our larger customer
>>> installations for a while; thus far, it's been working great and we are
>>> very thankful for the nice piece of software that it is. (We are a software
>>> company ourselves, using Prometheus to monitor the health of both our own
>>> application as well as many other relevant parts of the services involved).
>>> Because of the volume of metrics for some of our metrics, we have a number
>>> of recording rules set up, to make querying of this data reasonable from
>>> e.g. Grafana.
>>>
>>> However, today we started some really strange behavior after a planned
>>> restart on one of the Tomcat-based application services we are monitoring.
>>> Some requests *seems* to be peaking at 60s (indicating a problem in our
>>> application backend), but the strange thing here is that our recording
>>> rules produce very different results than just running the same queries in
>>> the Prometheus console.
>>>
>>> Here is how the recording rule has been defined in a
>>> custom_recording_rules.yml file:
>>>
>>> - name: hbx_controller_action_global
>>> rules:
>>> - record:
>>> global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
>>> expr: histogram_quantile(0.5, sum by
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>> - record:
>>> global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
>>> expr: histogram_quantile(0.75, sum by
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>> - record:
>>> global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
>>> expr: histogram_quantile(0.95, sum by
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>> - record:
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>> expr: histogram_quantile(0.99, sum by
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>
>>> Querying
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>> yields an output like this:
>>>
>>>
>>> However, running the individual query gives a completely different view
>>> of this data. Note how the 60-second peaks are completely gone in this
>>> screenshot:
>>>
>>>
>>> I don't really know what to make out of this. Are we doing something
>>> fundamentally wrong here in how our recording rules are set up, or could
>>> this be a bug in Prometheus (unlikely)? Btw, we have the
>>> evaluation_interval set to 15s globally.
>>>
>>> Thanks in advance.
>>>
>>> Best regards,
>>> Per
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected] <javascript:>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
>>>
>>> <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/bc1cd0a5-bce5-4487-8f32-b407ff88cea5%40googlegroups.com.