We're facing what I believe is the exact same issue, on v2.16.0. Although 
we also have some intermittent failures with kube-state-metrics which 
generates data for this query. I reckon the recording rule should either be 
empty (due to time skew) or showing the same value. But not having such a 
dip.

Here's the query that's plotted, on which the recording rule is based

count by (namespace) (kube_namespace_labels{label_xxx="123"})

absent series shows periods where kube-state-metrics is unavailable. Orange 
color where query and recording rules overlap.

[image: grafana.png]


On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
>
> Strange! Have you tried a more recent Prometheus version, btw.? Just to 
> rule that part out, since 2.13.1 is pretty old...
>
> On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <[email protected] 
> <javascript:>> wrote:
>
>> With global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, 
>> there are more 60s spikes shown if I change to a 15s or 5s interval. With 
>> the other query (histogram_quantile(0.99, sum by 
>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go 
>> above 1.2s, oddly enough.
>> On 2020-04-23 15:38, Julius Volz wrote:
>>
>> Odd. Depending on time window alignment it can always be that some spikes 
>> might appear in one graph and not another, but such a big difference is 
>> strange. Just to make sure, what happens when you bring down the resolution 
>> on both queries to 15s (which is your rule evaluation interval) or lower?
>>
>> On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected] 
>> <javascript:>> wrote:
>>
>>> Hi,
>>>
>>> We have been using Prometheus (2.13.1) with one of our larger customer 
>>> installations for a while; thus far, it's been working great and we are 
>>> very thankful for the nice piece of software that it is. (We are a software 
>>> company ourselves, using Prometheus to monitor the health of both our own 
>>> application as well as many other relevant parts of the services involved). 
>>> Because of the volume of metrics for some of our metrics, we have a number 
>>> of recording rules set up, to make querying of this data reasonable from 
>>> e.g. Grafana.
>>>
>>> However, today we started some really strange behavior after a planned 
>>> restart on one of the Tomcat-based application services we are monitoring. 
>>> Some requests *seems* to be peaking at 60s (indicating a problem in our 
>>> application backend), but the strange thing here is that our recording 
>>> rules produce very different results than just running the same queries in 
>>> the Prometheus console.
>>>
>>> Here is how the recording rule has been defined in a 
>>> custom_recording_rules.yml file:
>>>
>>>   - name: hbx_controller_action_global
>>>     rules:
>>>       - record: 
>>> global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
>>>         expr: histogram_quantile(0.5, sum by 
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>       - record: 
>>> global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
>>>         expr: histogram_quantile(0.75, sum by 
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>       - record: 
>>> global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
>>>         expr: histogram_quantile(0.95, sum by 
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>       - record: 
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>         expr: histogram_quantile(0.99, sum by 
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>
>>> Querying 
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m 
>>> yields an output like this:
>>>
>>>
>>> However, running the individual query gives a completely different view 
>>> of this data. Note how the 60-second peaks are completely gone in this 
>>> screenshot:
>>>
>>>
>>> I don't really know what to make out of this. Are we doing something 
>>> fundamentally wrong here in how our recording rules are set up, or could 
>>> this be a bug in Prometheus (unlikely)? Btw, we have the 
>>> evaluation_interval set to 15s globally.
>>>
>>> Thanks in advance.
>>>
>>> Best regards,
>>> Per
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bc1cd0a5-bce5-4487-8f32-b407ff88cea5%40googlegroups.com.

Reply via email to