Thank you Brian.  This helps.

Kevin

On Thursday, August 18, 2022 at 4:27:01 AM UTC-5 Brian Candler wrote:

> BTW, I just did a quick test.  When setting my graph display range to 2w 
> in the Prometheus web interface, I found that adjacent data points were 
> just under 81 minutes apart.  So the query
>
>     max_over_time(ALERTS[81m])
>
> was able to show lots of short-lived alerts, which the plain query
>
>     ALERTS
>
> did not.  Setting it longer, e.g. to [3h], smears those alerts over 
> multiple graph points, as expected.
>
> On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote:
>
>> Presumably you are using the PromQL query browser built into prometheus? 
>> (Not some third party tool like Grafana etc?)
>>
>> When you draw a graph from time T1 to T2, you send the prometheus API a 
>> range 
>> query 
>> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries> 
>> to repeatedly evaluate an instant vector query over a time range from T1 to 
>> T2 with some step S.  The step S is chosen by the client so that it a 
>> suitable number fit in the display, e.g. if it wants 200 data points then 
>> it could chose step = (T2 - T1) / 200.  In the prometheus graph view you 
>> can see this by moving your mouse left and right over the graph; a pop-up 
>> shows you each data point, and you can see it switch from point to point as 
>> you move left to right.
>>
>> Therefore, it's showing the values of the timeseries at the instants T1, 
>> T1+S, T1+2S, ... T2-S,T2.
>>
>> When evaluating a timeseries at a given instant in time, it finds the 
>> closest value *at or before* that time (up to a maximum lookback interval, 
>> which by default is 5 minutes).
>>
>> Therefore, your graph is showing *samples* of the data in the TSDB.  If 
>> you zoom out too far, you may be missing "interesting" values.  For example:
>>
>> TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
>> Graph:       0         0         1         0         0 ...
>>
>> Counters make this less of a problem: you can get your graph to show how 
>> the counter has *increased* between two adjacent points (usually then 
>> divided by the step time, to get a rate).
>>
>> However, the problem for a metric like ALERTS is it's not a counter, and 
>> it doesn't even switch between 0 and 1, but the whole timeseries appears 
>> and disappears.  (In fact, it creates separate timeseries for when the 
>> alert is in state "pending" and "firing").  If you graph step is more than 
>> 5 minutes, you may not catch the alert's presence at all.
>>
>> What you could try is a query like this:
>>
>> max_over_time(ALERTS{alertname="CPUUtilization"}[1h])
>>
>> The inner query is a range vector: it returns all data points within a 1 
>> hour window, between 1 hour before the evaluation time up to the evaluation 
>> time.  Then if *any* data points exist in that window, the highest one 
>> returned, forming an instant vector again.  When your graph sweeps this 
>> expression over a time period from T1 to T2, then each data point will 
>> cover one hour. That should catch the "missing" samples.
>>
>> Of course, the time window is fixed to 1h in that query, and you may need 
>> to adjust it depending on your graph zoom level, to match the time period 
>> between adjacent points on the graph.  If you're using grafana, there's a 
>> magic 
>> variable 
>> <https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__interval>
>>  
>> $__interval you can use.  I vaguely remember seeing a proposal for PromQL 
>> to have a way of referring to "the current step interval" in a range vector 
>> expression, but I don't know what happened to that.
>>
>> HTH,
>>
>> Brian.
>>
>> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 [email protected] wrote:
>>
>>> I am currently looking for all CPU alerts using a query of 
>>> ALERTS{alertname="CPUUtilization"}
>>>
>>> I am stepping through the graph time frame one click at a time.  
>>>
>>> At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I 
>>> get 4 entries but not the one I found at 12h.  I would expect to get 
>>> everything from 2d to now.
>>>
>>> At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect 
>>> to get everything from 2w to now.
>>>
>>> Last week I ran this same query and found the alert I was looking for 
>>> back in April.  Today I ran the same query and I cannot find that alert 
>>> from April.
>>>
>>> I see this behavior in multiple Prometheus environments.
>>>
>>> Is this a problem or the way the graphing works in Prometheus?
>>>
>>> Prometheus version is 2.29.1
>>> Prometheus retention period is 1y
>>> DB is currently 1.2TB.  There are DBs as large as 5TB in other 
>>> Prometheus environments.
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4deb93f5-0070-4846-8e93-08460f687e8cn%40googlegroups.com.

Reply via email to