Interesting Marcin. Could indeed be the same root cause yes.

Julius - I tried (by synchronizing 200 GiB of data to my local machine ;) 
with a more recent Prometheus version, 2.18.1, and I get the same behavior 
still.

I get the feeling that this could be a bug in Prometheus. Should I perhaps 
report it via the GitHub issue tracker?

Best regards,
Per

On Wednesday, May 13, 2020 at 12:09:43 PM UTC+3, Marcin Chmiel wrote:
>
> We're facing what I believe is the exact same issue, on v2.16.0. Although 
> we also have some intermittent failures with kube-state-metrics which 
> generates data for this query. I reckon the recording rule should either be 
> empty (due to time skew) or showing the same value. But not having such a 
> dip.
>
> Here's the query that's plotted, on which the recording rule is based
>
> count by (namespace) (kube_namespace_labels{label_xxx="123"})
>
> absent series shows periods where kube-state-metrics is unavailable. 
> Orange color where query and recording rules overlap.
>
> [image: grafana.png]
>
>
> On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
>>
>> Strange! Have you tried a more recent Prometheus version, btw.? Just to 
>> rule that part out, since 2.13.1 is pretty old...
>>
>> On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <[email protected]> wrote:
>>
>>> With 
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, there 
>>> are more 60s spikes shown if I change to a 15s or 5s interval. With the 
>>> other query (histogram_quantile(0.99, sum by 
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go 
>>> above 1.2s, oddly enough.
>>> On 2020-04-23 15:38, Julius Volz wrote:
>>>
>>> Odd. Depending on time window alignment it can always be that some 
>>> spikes might appear in one graph and not another, but such a big difference 
>>> is strange. Just to make sure, what happens when you bring down the 
>>> resolution on both queries to 15s (which is your rule evaluation interval) 
>>> or lower?
>>>
>>> On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have been using Prometheus (2.13.1) with one of our larger customer 
>>>> installations for a while; thus far, it's been working great and we are 
>>>> very thankful for the nice piece of software that it is. (We are a 
>>>> software 
>>>> company ourselves, using Prometheus to monitor the health of both our own 
>>>> application as well as many other relevant parts of the services 
>>>> involved). 
>>>> Because of the volume of metrics for some of our metrics, we have a number 
>>>> of recording rules set up, to make querying of this data reasonable from 
>>>> e.g. Grafana.
>>>>
>>>> However, today we started some really strange behavior after a planned 
>>>> restart on one of the Tomcat-based application services we are monitoring. 
>>>> Some requests *seems* to be peaking at 60s (indicating a problem in 
>>>> our application backend), but the strange thing here is that our recording 
>>>> rules produce very different results than just running the same queries in 
>>>> the Prometheus console.
>>>>
>>>> Here is how the recording rule has been defined in a 
>>>> custom_recording_rules.yml file:
>>>>
>>>>   - name: hbx_controller_action_global
>>>>     rules:
>>>>       - record: 
>>>> global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
>>>>         expr: histogram_quantile(0.5, sum by 
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>       - record: 
>>>> global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
>>>>         expr: histogram_quantile(0.75, sum by 
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>       - record: 
>>>> global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
>>>>         expr: histogram_quantile(0.95, sum by 
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>       - record: 
>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>>         expr: histogram_quantile(0.99, sum by 
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>
>>>> Querying 
>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m 
>>>> yields an output like this:
>>>>
>>>>
>>>> However, running the individual query gives a completely different view 
>>>> of this data. Note how the 60-second peaks are completely gone in this 
>>>> screenshot:
>>>>
>>>>
>>>> I don't really know what to make out of this. Are we doing something 
>>>> fundamentally wrong here in how our recording rules are set up, or could 
>>>> this be a bug in Prometheus (unlikely)? Btw, we have the 
>>>> evaluation_interval set to 15s globally.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Best regards,
>>>> Per
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/15d9d01f-263e-4803-8159-a33f1a8b0cfa%40googlegroups.com.

Reply via email to