If you want servers that have been down for 30 days, then I thought it
should be obvious you need max_over_time(up[30d]) == 0 ... but perhaps it
isn't as obvious as I thought.
Let me break that query down into parts:
up[30d] : returns a *range vector* containing all data points for the
timeseries with metric name "up" from T - 30 days to T (where T is the
evaluation time, i.e. the point on the X axis)
By "timeseries" I mean distinct combination of metric name and labels, e.g.
up{instance="foo"}
up{instance="bar"}
are two different timeseries. They happen to share the same metric name
("up") but they are recording an independent sequence of measurements.
Think of the range vector as a two-dimensional grid: there are N different
timeseries, each with M data points over that period. The data collected
and stored in the TSDB might look like this:
up{instance="foo"} v1 . . . v2 . . . v3 . . .
up{instance="bar"} . . v4 . . . v5 . . . v6 .
-------------------------> time
Then:
max_over_time(...) : for each timeseries in the range vector, picks the
highest value. This returns an *instant vector*, i.e. a single value for
every timeseries, which is the maximum of each.
up{instance="foo"} v3
up{instance="bar"} v5
Each of those values is the maximum value of the timeseries, over the 30
day period.
Now, you've chosen to draw a graph of this expression, but it's important
to realise that the graph itself doesn't need to be over 30 days. When you
draw a graph of an expression, it will sweep across the evaluation time,
evaluating the expression repeatedly at different instants in time over the
given period.
Let's say, for example, you set the graph range to be 1 week, but you are
graphing max_over_time(up[30d]) == 0
What will you get? This will be a series of points. Let's imagine the
graph only had one point per day. Considering the position of each point on
the time axis:
Aug 17: shows if the server has been down from (Aug 17 - 30 days) to (Aug
17)
Aug 16: shows if the server has been down from (Aug 16 - 30 days) to (Aug
16)
...
Aug 10: shows if the server has been down from (Aug 10 - 30 days) to (Aug
10)
In fact, for your purposes (asking, has the server been down for the *last
30 days*?) you don't need to draw a graph at all! In which case, if you
turn on the "Instant" switch in Grafana it will only ask Prometheus to
evaluate the expression for the current instant, which makes the query much
faster and cheaper.
This is then an ideal query to use in a dashboard, where you just want to
show a list of servers that have been down for the last 30 days. You don't
care, for example, if 2 days ago they were down for the 30 days before that
point, do you? Because that's what basically a graph of that expression
will tell you: at each point in time, whether it was down for the previous
30 days.
On Wednesday, 17 August 2022 at 14:09:42 UTC+1 [email protected] wrote:
> [image: up.PNG]
> this is the query I am using and the above graph is for 30 days and it is
> down from the last day. I want the servers that are down for the whole 30
> days
> On Wednesday, 17 August 2022 at 12:55:48 UTC+5:30 Brian Candler wrote:
>
>> Extraordinary claims require extraordinary evidence.
>>
>> I don't believe there's a bug in prometheus: I believe there's a bug in
>> how you are using it. But unless you show the data, there's no way to
>> demonstrate this.
>>
>> On Wednesday, 17 August 2022 at 04:36:43 UTC+1 [email protected]
>> wrote:
>>
>>>
>>> yeah. I want only that the servers are down for the whole two days. Its
>>> value should always be zero(0) throughout the last 'X' days.
>>>
>>> But max_over_time is giving me the info if the servers are down for even
>>> one minute from the last 'X' days.
>>>
>>> Thanks & regards,
>>> Bharath kumar.
>>> On Tuesday, 16 August 2022 at 20:27:30 UTC+5:30 Stuart Clark wrote:
>>>
>>>> On 2022-08-16 15:08, BHARATH KUMAR wrote:
>>>> > hello,
>>>> >
>>>> > max_over_time(up[2d]) == 0 is giving me the info like ...for the last
>>>> > two days if the server goes down for 1 minute also it was displaying
>>>> > in the graph which I don't want. I want the information that for the
>>>> > last "X" days it should be completely in an unreachable state.
>>>> >
>>>>
>>>> So you are only wanting it if every single scrape failed over the past
>>>> 2
>>>> days?
>>>>
>>>> Try sum() instead of max_over_time().
>>>>
>>>> --
>>>> Stuart Clark
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/9bfc2837-952d-4177-8b8c-2058fd03522cn%40googlegroups.com.