Re: [prometheus-users] Re: up query

BHARATH KUMAR Wed, 24 Aug 2022 03:43:20 -0700

I saw some blog in google as below:

If you want to count the time spent in down state, this becomes more 
complicated because you have to detect the switch from 1 to 0 which count 
for 1min and the subsequent down state until the first switch back from 0 
to 1.


It could be something along the lines of:

(max_over_time(up[60s]) == bool 0) * ((up offset 61s == bool 1) * 
count(up[60s]) OR vector(1)) ---> query

But the above query threw me an error as below:

bad_data: 1:73: parse error: expected type instant vector in aggregation 
expression, got range vector


What I am missing here... How I can achieve this solution like "find the 
instances that have been completely in down state for last X days"

Thanks & regards,

Bharath Kumar.

On Wednesday, 17 August 2022 at 19:42:26 UTC+5:30 Brian Candler wrote:

> If you want servers that have been down for 30 days, then I thought it 
> should be obvious you need max_over_time(up[30d]) == 0  ... but perhaps it 
> isn't as obvious as I thought.
>
> Let me break that query down into parts:
>
> up[30d]   :   returns a *range vector* containing all data points for the 
> timeseries with metric name "up" from T - 30 days to T (where T is the 
> evaluation time, i.e. the point on the X axis)
>
> By "timeseries" I mean distinct combination of metric name and labels, e.g.
> up{instance="foo"}
> up{instance="bar"}
> are two different timeseries.  They happen to share the same metric name 
> ("up") but they are recording an independent sequence of measurements.
>
> Think of the range vector as a two-dimensional grid: there are N different 
> timeseries, each with M data points over that period. The data collected 
> and stored in the TSDB might look like this:
>
> up{instance="foo"}  v1 . . . v2 . . . v3 . . .
> up{instance="bar"}  . . v4 . . . v5 . . . v6 .
>                     -------------------------> time
>
> Then:
> max_over_time(...)  :  for each timeseries in the range vector, picks the 
> highest value.  This returns an *instant vector*, i.e. a single value for 
> every timeseries, which is the maximum of each.
>
> up{instance="foo"}  v3
> up{instance="bar"}  v5
>
> Each of those values is the maximum value of the timeseries, over the 30 
> day period.
>
> Now, you've chosen to draw a graph of this expression, but it's important 
> to realise that the graph itself doesn't need to be over 30 days.  When you 
> draw a graph of an expression, it will sweep across the evaluation time, 
> evaluating the expression repeatedly at different instants in time over the 
> given period.
>
> Let's say, for example, you set the graph range to be 1 week, but you are 
> graphing max_over_time(up[30d]) == 0
>
> What will you get?  This will be a series of points.  Let's imagine the 
> graph only had one point per day. Considering the position of each point on 
> the time axis:
> Aug 17: shows if the server has been down from (Aug 17 - 30 days) to (Aug 
> 17)
> Aug 16: shows if the server has been down from (Aug 16 - 30 days) to (Aug 
> 16)
> ...
> Aug 10: shows if the server has been down from (Aug 10 - 30 days) to (Aug 
> 10)
>
> In fact, for your purposes (asking, has the server been down for the *last 
> 30 days*?) you don't need to draw a graph at all!  In which case, if you 
> turn on the "Instant" switch in Grafana it will only ask Prometheus to 
> evaluate the expression for the current instant, which makes the query much 
> faster and cheaper.
>
> This is then an ideal query to use in a dashboard, where you just want to 
> show a list of servers that have been down for the last 30 days.  You don't 
> care, for example, if 2 days ago they were down for the 30 days before that 
> point, do you?  Because that's what basically a graph of that expression 
> will tell you: at each point in time, whether it was down for the previous 
> 30 days.
>
> On Wednesday, 17 August 2022 at 14:09:42 UTC+1 [email protected] 
> wrote:
>
>> [image: up.PNG]
>> this is the query I am using and the above graph is for 30 days and it is 
>> down from the last day. I want the servers that are down for the whole 30 
>> days
>> On Wednesday, 17 August 2022 at 12:55:48 UTC+5:30 Brian Candler wrote:
>>
>>> Extraordinary claims require extraordinary evidence.
>>>
>>> I don't believe there's a bug in prometheus: I believe there's a bug in 
>>> how you are using it.  But unless you show the data, there's no way to 
>>> demonstrate this.
>>>
>>> On Wednesday, 17 August 2022 at 04:36:43 UTC+1 [email protected] 
>>> wrote:
>>>
>>>>
>>>> yeah. I want only that the servers are down for the whole two days. Its 
>>>> value should always be zero(0) throughout the last 'X' days.
>>>>
>>>> But max_over_time is giving me the info if the servers are down for 
>>>> even one minute from the last 'X' days.
>>>>
>>>> Thanks & regards,
>>>> Bharath kumar.
>>>> On Tuesday, 16 August 2022 at 20:27:30 UTC+5:30 Stuart Clark wrote:
>>>>
>>>>> On 2022-08-16 15:08, BHARATH KUMAR wrote: 
>>>>> > hello, 
>>>>> > 
>>>>> > max_over_time(up[2d]) == 0 is giving me the info like ...for the 
>>>>> last 
>>>>> > two days if the server goes down for 1 minute also it was displaying 
>>>>> > in the graph which I don't want. I want the information that for the 
>>>>> > last "X" days it should be completely in an unreachable state. 
>>>>> > 
>>>>>
>>>>> So you are only wanting it if every single scrape failed over the past 
>>>>> 2 
>>>>> days? 
>>>>>
>>>>> Try sum() instead of max_over_time(). 
>>>>>
>>>>> -- 
>>>>> Stuart Clark 
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/47873d21-7f4c-42ac-9753-0651d8f26640n%40googlegroups.com.

Re: [prometheus-users] Re: up query

Reply via email to