Thank you Brian. This helps. Kevin
On Thursday, August 18, 2022 at 4:27:01 AM UTC-5 Brian Candler wrote: > BTW, I just did a quick test. When setting my graph display range to 2w > in the Prometheus web interface, I found that adjacent data points were > just under 81 minutes apart. So the query > > max_over_time(ALERTS[81m]) > > was able to show lots of short-lived alerts, which the plain query > > ALERTS > > did not. Setting it longer, e.g. to [3h], smears those alerts over > multiple graph points, as expected. > > On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote: > >> Presumably you are using the PromQL query browser built into prometheus? >> (Not some third party tool like Grafana etc?) >> >> When you draw a graph from time T1 to T2, you send the prometheus API a >> range >> query >> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries> >> to repeatedly evaluate an instant vector query over a time range from T1 to >> T2 with some step S. The step S is chosen by the client so that it a >> suitable number fit in the display, e.g. if it wants 200 data points then >> it could chose step = (T2 - T1) / 200. In the prometheus graph view you >> can see this by moving your mouse left and right over the graph; a pop-up >> shows you each data point, and you can see it switch from point to point as >> you move left to right. >> >> Therefore, it's showing the values of the timeseries at the instants T1, >> T1+S, T1+2S, ... T2-S,T2. >> >> When evaluating a timeseries at a given instant in time, it finds the >> closest value *at or before* that time (up to a maximum lookback interval, >> which by default is 5 minutes). >> >> Therefore, your graph is showing *samples* of the data in the TSDB. If >> you zoom out too far, you may be missing "interesting" values. For example: >> >> TSDB : 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... >> Graph: 0 0 1 0 0 ... >> >> Counters make this less of a problem: you can get your graph to show how >> the counter has *increased* between two adjacent points (usually then >> divided by the step time, to get a rate). >> >> However, the problem for a metric like ALERTS is it's not a counter, and >> it doesn't even switch between 0 and 1, but the whole timeseries appears >> and disappears. (In fact, it creates separate timeseries for when the >> alert is in state "pending" and "firing"). If you graph step is more than >> 5 minutes, you may not catch the alert's presence at all. >> >> What you could try is a query like this: >> >> max_over_time(ALERTS{alertname="CPUUtilization"}[1h]) >> >> The inner query is a range vector: it returns all data points within a 1 >> hour window, between 1 hour before the evaluation time up to the evaluation >> time. Then if *any* data points exist in that window, the highest one >> returned, forming an instant vector again. When your graph sweeps this >> expression over a time period from T1 to T2, then each data point will >> cover one hour. That should catch the "missing" samples. >> >> Of course, the time window is fixed to 1h in that query, and you may need >> to adjust it depending on your graph zoom level, to match the time period >> between adjacent points on the graph. If you're using grafana, there's a >> magic >> variable >> <https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__interval> >> >> $__interval you can use. I vaguely remember seeing a proposal for PromQL >> to have a way of referring to "the current step interval" in a range vector >> expression, but I don't know what happened to that. >> >> HTH, >> >> Brian. >> >> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 [email protected] wrote: >> >>> I am currently looking for all CPU alerts using a query of >>> ALERTS{alertname="CPUUtilization"} >>> >>> I am stepping through the graph time frame one click at a time. >>> >>> At the 12h time, I get one entry. At 1d I get zero entries. At 2d, I >>> get 4 entries but not the one I found at 12h. I would expect to get >>> everything from 2d to now. >>> >>> At 1w, I get 8 entries but at 2w, I only get 5 entries. I would expect >>> to get everything from 2w to now. >>> >>> Last week I ran this same query and found the alert I was looking for >>> back in April. Today I ran the same query and I cannot find that alert >>> from April. >>> >>> I see this behavior in multiple Prometheus environments. >>> >>> Is this a problem or the way the graphing works in Prometheus? >>> >>> Prometheus version is 2.29.1 >>> Prometheus retention period is 1y >>> DB is currently 1.2TB. There are DBs as large as 5TB in other >>> Prometheus environments. >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4deb93f5-0070-4846-8e93-08460f687e8cn%40googlegroups.com.

