Interesting Marcin. Could indeed be the same root cause yes.
Julius - I tried (by synchronizing 200 GiB of data to my local machine ;)
with a more recent Prometheus version, 2.18.1, and I get the same behavior
still.
I get the feeling that this could be a bug in Prometheus. Should I perhaps
report it via the GitHub issue tracker?
Best regards,
Per
On Wednesday, May 13, 2020 at 12:09:43 PM UTC+3, Marcin Chmiel wrote:
>
> We're facing what I believe is the exact same issue, on v2.16.0. Although
> we also have some intermittent failures with kube-state-metrics which
> generates data for this query. I reckon the recording rule should either be
> empty (due to time skew) or showing the same value. But not having such a
> dip.
>
> Here's the query that's plotted, on which the recording rule is based
>
> count by (namespace) (kube_namespace_labels{label_xxx="123"})
>
> absent series shows periods where kube-state-metrics is unavailable.
> Orange color where query and recording rules overlap.
>
> [image: grafana.png]
>
>
> On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
>>
>> Strange! Have you tried a more recent Prometheus version, btw.? Just to
>> rule that part out, since 2.13.1 is pretty old...
>>
>> On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <[email protected]> wrote:
>>
>>> With
>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, there
>>> are more 60s spikes shown if I change to a 15s or 5s interval. With the
>>> other query (histogram_quantile(0.99, sum by
>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go
>>> above 1.2s, oddly enough.
>>> On 2020-04-23 15:38, Julius Volz wrote:
>>>
>>> Odd. Depending on time window alignment it can always be that some
>>> spikes might appear in one graph and not another, but such a big difference
>>> is strange. Just to make sure, what happens when you bring down the
>>> resolution on both queries to 15s (which is your rule evaluation interval)
>>> or lower?
>>>
>>> On Thu, Apr 23, 2020 at 12:59 PM Per Lundberg <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have been using Prometheus (2.13.1) with one of our larger customer
>>>> installations for a while; thus far, it's been working great and we are
>>>> very thankful for the nice piece of software that it is. (We are a
>>>> software
>>>> company ourselves, using Prometheus to monitor the health of both our own
>>>> application as well as many other relevant parts of the services
>>>> involved).
>>>> Because of the volume of metrics for some of our metrics, we have a number
>>>> of recording rules set up, to make querying of this data reasonable from
>>>> e.g. Grafana.
>>>>
>>>> However, today we started some really strange behavior after a planned
>>>> restart on one of the Tomcat-based application services we are monitoring.
>>>> Some requests *seems* to be peaking at 60s (indicating a problem in
>>>> our application backend), but the strange thing here is that our recording
>>>> rules produce very different results than just running the same queries in
>>>> the Prometheus console.
>>>>
>>>> Here is how the recording rule has been defined in a
>>>> custom_recording_rules.yml file:
>>>>
>>>> - name: hbx_controller_action_global
>>>> rules:
>>>> - record:
>>>> global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
>>>> expr: histogram_quantile(0.5, sum by
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>> - record:
>>>> global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
>>>> expr: histogram_quantile(0.75, sum by
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>> - record:
>>>> global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
>>>> expr: histogram_quantile(0.95, sum by
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>> - record:
>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>> expr: histogram_quantile(0.99, sum by
>>>> (le)(rate(hbx_controller_action_seconds_bucket[1m])))
>>>>
>>>> Querying
>>>> global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
>>>> yields an output like this:
>>>>
>>>>
>>>> However, running the individual query gives a completely different view
>>>> of this data. Note how the 60-second peaks are completely gone in this
>>>> screenshot:
>>>>
>>>>
>>>> I don't really know what to make out of this. Are we doing something
>>>> fundamentally wrong here in how our recording rules are set up, or could
>>>> this be a bug in Prometheus (unlikely)? Btw, we have the
>>>> evaluation_interval set to 15s globally.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Best regards,
>>>> Per
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com
>>>>
>>>> <https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/15d9d01f-263e-4803-8159-a33f1a8b0cfa%40googlegroups.com.