BTW, I just did a quick test. When setting my graph display range to 2w in
the Prometheus web interface, I found that adjacent data points were just
under 81 minutes apart. So the query
max_over_time(ALERTS[81m])
was able to show lots of short-lived alerts, which the plain query
ALERTS
did not. Setting it longer, e.g. to [3h], smears those alerts over
multiple graph points, as expected.
On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote:
> Presumably you are using the PromQL query browser built into prometheus?
> (Not some third party tool like Grafana etc?)
>
> When you draw a graph from time T1 to T2, you send the prometheus API a range
> query
> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries>
> to repeatedly evaluate an instant vector query over a time range from T1 to
> T2 with some step S. The step S is chosen by the client so that it a
> suitable number fit in the display, e.g. if it wants 200 data points then
> it could chose step = (T2 - T1) / 200. In the prometheus graph view you
> can see this by moving your mouse left and right over the graph; a pop-up
> shows you each data point, and you can see it switch from point to point as
> you move left to right.
>
> Therefore, it's showing the values of the timeseries at the instants T1,
> T1+S, T1+2S, ... T2-S,T2.
>
> When evaluating a timeseries at a given instant in time, it finds the
> closest value *at or before* that time (up to a maximum lookback interval,
> which by default is 5 minutes).
>
> Therefore, your graph is showing *samples* of the data in the TSDB. If
> you zoom out too far, you may be missing "interesting" values. For example:
>
> TSDB : 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ...
> Graph: 0 0 1 0 0 ...
>
> Counters make this less of a problem: you can get your graph to show how
> the counter has *increased* between two adjacent points (usually then
> divided by the step time, to get a rate).
>
> However, the problem for a metric like ALERTS is it's not a counter, and
> it doesn't even switch between 0 and 1, but the whole timeseries appears
> and disappears. (In fact, it creates separate timeseries for when the
> alert is in state "pending" and "firing"). If you graph step is more than
> 5 minutes, you may not catch the alert's presence at all.
>
> What you could try is a query like this:
>
> max_over_time(ALERTS{alertname="CPUUtilization"}[1h])
>
> The inner query is a range vector: it returns all data points within a 1
> hour window, between 1 hour before the evaluation time up to the evaluation
> time. Then if *any* data points exist in that window, the highest one
> returned, forming an instant vector again. When your graph sweeps this
> expression over a time period from T1 to T2, then each data point will
> cover one hour. That should catch the "missing" samples.
>
> Of course, the time window is fixed to 1h in that query, and you may need
> to adjust it depending on your graph zoom level, to match the time period
> between adjacent points on the graph. If you're using grafana, there's a
> magic
> variable
> <https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__interval>
>
> $__interval you can use. I vaguely remember seeing a proposal for PromQL
> to have a way of referring to "the current step interval" in a range vector
> expression, but I don't know what happened to that.
>
> HTH,
>
> Brian.
>
> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 [email protected] wrote:
>
>> I am currently looking for all CPU alerts using a query of
>> ALERTS{alertname="CPUUtilization"}
>>
>> I am stepping through the graph time frame one click at a time.
>>
>> At the 12h time, I get one entry. At 1d I get zero entries. At 2d, I
>> get 4 entries but not the one I found at 12h. I would expect to get
>> everything from 2d to now.
>>
>> At 1w, I get 8 entries but at 2w, I only get 5 entries. I would expect
>> to get everything from 2w to now.
>>
>> Last week I ran this same query and found the alert I was looking for
>> back in April. Today I ran the same query and I cannot find that alert
>> from April.
>>
>> I see this behavior in multiple Prometheus environments.
>>
>> Is this a problem or the way the graphing works in Prometheus?
>>
>> Prometheus version is 2.29.1
>> Prometheus retention period is 1y
>> DB is currently 1.2TB. There are DBs as large as 5TB in other Prometheus
>> environments.
>>
>>
>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6e2b25fa-105d-4428-8123-646718962ae7n%40googlegroups.com.