BTW, I just did a quick test.  When setting my graph display range to 2w in 
the Prometheus web interface, I found that adjacent data points were just 
under 81 minutes apart.  So the query

    max_over_time(ALERTS[81m])

was able to show lots of short-lived alerts, which the plain query

    ALERTS

did not.  Setting it longer, e.g. to [3h], smears those alerts over 
multiple graph points, as expected.

On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote:

> Presumably you are using the PromQL query browser built into prometheus? 
> (Not some third party tool like Grafana etc?)
>
> When you draw a graph from time T1 to T2, you send the prometheus API a range 
> query 
> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries> 
> to repeatedly evaluate an instant vector query over a time range from T1 to 
> T2 with some step S.  The step S is chosen by the client so that it a 
> suitable number fit in the display, e.g. if it wants 200 data points then 
> it could chose step = (T2 - T1) / 200.  In the prometheus graph view you 
> can see this by moving your mouse left and right over the graph; a pop-up 
> shows you each data point, and you can see it switch from point to point as 
> you move left to right.
>
> Therefore, it's showing the values of the timeseries at the instants T1, 
> T1+S, T1+2S, ... T2-S,T2.
>
> When evaluating a timeseries at a given instant in time, it finds the 
> closest value *at or before* that time (up to a maximum lookback interval, 
> which by default is 5 minutes).
>
> Therefore, your graph is showing *samples* of the data in the TSDB.  If 
> you zoom out too far, you may be missing "interesting" values.  For example:
>
> TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
> Graph:       0         0         1         0         0 ...
>
> Counters make this less of a problem: you can get your graph to show how 
> the counter has *increased* between two adjacent points (usually then 
> divided by the step time, to get a rate).
>
> However, the problem for a metric like ALERTS is it's not a counter, and 
> it doesn't even switch between 0 and 1, but the whole timeseries appears 
> and disappears.  (In fact, it creates separate timeseries for when the 
> alert is in state "pending" and "firing").  If you graph step is more than 
> 5 minutes, you may not catch the alert's presence at all.
>
> What you could try is a query like this:
>
> max_over_time(ALERTS{alertname="CPUUtilization"}[1h])
>
> The inner query is a range vector: it returns all data points within a 1 
> hour window, between 1 hour before the evaluation time up to the evaluation 
> time.  Then if *any* data points exist in that window, the highest one 
> returned, forming an instant vector again.  When your graph sweeps this 
> expression over a time period from T1 to T2, then each data point will 
> cover one hour. That should catch the "missing" samples.
>
> Of course, the time window is fixed to 1h in that query, and you may need 
> to adjust it depending on your graph zoom level, to match the time period 
> between adjacent points on the graph.  If you're using grafana, there's a 
> magic 
> variable 
> <https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__interval>
>  
> $__interval you can use.  I vaguely remember seeing a proposal for PromQL 
> to have a way of referring to "the current step interval" in a range vector 
> expression, but I don't know what happened to that.
>
> HTH,
>
> Brian.
>
> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 [email protected] wrote:
>
>> I am currently looking for all CPU alerts using a query of 
>> ALERTS{alertname="CPUUtilization"}
>>
>> I am stepping through the graph time frame one click at a time.  
>>
>> At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I 
>> get 4 entries but not the one I found at 12h.  I would expect to get 
>> everything from 2d to now.
>>
>> At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect 
>> to get everything from 2w to now.
>>
>> Last week I ran this same query and found the alert I was looking for 
>> back in April.  Today I ran the same query and I cannot find that alert 
>> from April.
>>
>> I see this behavior in multiple Prometheus environments.
>>
>> Is this a problem or the way the graphing works in Prometheus?
>>
>> Prometheus version is 2.29.1
>> Prometheus retention period is 1y
>> DB is currently 1.2TB.  There are DBs as large as 5TB in other Prometheus 
>> environments.
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6e2b25fa-105d-4428-8123-646718962ae7n%40googlegroups.com.

Reply via email to