Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Sam Rose Wed, 06 Apr 2022 05:56:17 -0700

Thanks for the heads up! We've flip flopped a bit between using 1m or 2m. 
1m seems to work reliably enough to be useful in most situations, but I'll 
probably end up going back to 2m after this discussion.


I don't believe that helps with the reset problem though, right? I retried 
the queries using 2m instead of 1m and they still exhibit the same problem.

Is there any more data I can get you to help debug the problem? We see this 
happen multiple times per day, and it's making it difficult to monitor our 
systems in production.

On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:

> Yup, PromQL thinks there's a small dip in the data. I'm not sure why tho. 
> I took your raw values:
>
> 225201
> 225226
> 225249
> 225262
> 225278
> 225310
> 225329
> 225363
> 225402
> 225437
> 225466
> 225492
> 225529
> 225555
> 225595
>
> $ awk '{print $1-225201}' values
> 0
> 25
> 48
> 61
> 77
> 109
> 128
> 162
> 201
> 236
> 265
> 291
> 328
> 354
> 394
>
> I'm not seeing the reset there.
>
> One thing I noticed, your data interval is 60 seconds and you are doing a 
> rate(counter[1m]). This is not going to work reliably, because you are 
> likely to not have two samples in the same step window. This is because 
> Prometheus uses millisecond timestamps, so if you have timestamps at these 
> times:
>
> 5.335
> 65.335
> 125.335
>
> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align 
> queries to even minutes for consistency), the only sample you'll get back 
> is 65.335.
>
> You need to do rate(counter[2m]) in order to avoid problems.
>
> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>
>> I just learned about the resets() function and applying it does seem to 
>> show that a reset occurred:
>>
>> {
>>   "request": {
>>     "url": 
>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>     "method": "GET",
>>     "hideFromInspector": false
>>   },
>>   "response": {
>>     "status": "success",
>>     "data": {
>>       "resultType": "matrix",
>>       "result": [
>>         {
>>           "metric": {/* redacted */},
>>           "values": [
>>             [
>>               1649239200,
>>               "0"
>>             ],
>>             [
>>               1649239260,
>>               "0"
>>             ],
>>             [
>>               1649239320,
>>               "0"
>>             ],
>>             [
>>               1649239380,
>>               "0"
>>             ],
>>             [
>>               1649239440,
>>               "0"
>>             ],
>>             [
>>               1649239500,
>>               "0"
>>             ],
>>             [
>>               1649239560,
>>               "0"
>>             ],
>>             [
>>               1649239620,
>>               "0"
>>             ],
>>             [
>>               1649239680,
>>               "0"
>>             ],
>>             [
>>               1649239740,
>>               "1"
>>             ],
>>             [
>>               1649239800,
>>               "0"
>>             ],
>>             [
>>               1649239860,
>>               "0"
>>             ],
>>             [
>>               1649239920,
>>               "0"
>>             ],
>>             [
>>               1649239980,
>>               "0"
>>             ],
>>             [
>>               1649240040,
>>               "0"
>>             ],
>>             [
>>               1649240100,
>>               "0"
>>             ]
>>           ]
>>         }
>>       ]
>>     }
>>   }
>> }
>>
>> I don't quite understand how, though.
>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>
>>> Hi there,
>>>
>>> We're seeing really large spikes when using the `rate()` function on 
>>> some of our metrics. I've been able to isolate a single time series that 
>>> displays this problem, which I'm going to call `counter`. I haven't 
>>> attached the actual metric labels here, but all of the data you see here is 
>>> from `counter` over the same time period.
>>>
>>> This is the raw data, as obtained through a request to /api/v1/query:
>>>
>>> {
>>>     "data": {
>>>         "result": [
>>>             {
>>>                 "metric": {/* redacted */},
>>>                 "values": [
>>>                     [
>>>                         1649239253.4,
>>>                         "225201"
>>>                     ],
>>>                     [
>>>                         1649239313.4,
>>>                         "225226"
>>>                     ],
>>>                     [
>>>                         1649239373.4,
>>>                         "225249"
>>>                     ],
>>>                     [
>>>                         1649239433.4,
>>>                         "225262"
>>>                     ],
>>>                     [
>>>                         1649239493.4,
>>>                         "225278"
>>>                     ],
>>>                     [
>>>                         1649239553.4,
>>>                         "225310"
>>>                     ],
>>>                     [
>>>                         1649239613.4,
>>>                         "225329"
>>>                     ],
>>>                     [
>>>                         1649239673.4,
>>>                         "225363"
>>>                     ],
>>>                     [
>>>                         1649239733.4,
>>>                         "225402"
>>>                     ],
>>>                     [
>>>                         1649239793.4,
>>>                         "225437"
>>>                     ],
>>>                     [
>>>                         1649239853.4,
>>>                         "225466"
>>>                     ],
>>>                     [
>>>                         1649239913.4,
>>>                         "225492"
>>>                     ],
>>>                     [
>>>                         1649239973.4,
>>>                         "225529"
>>>                     ],
>>>                     [
>>>                         1649240033.4,
>>>                         "225555"
>>>                     ],
>>>                     [
>>>                         1649240093.4,
>>>                         "225595"
>>>                     ]
>>>                 ]
>>>             }
>>>         ],
>>>         "resultType": "matrix"
>>>     },
>>>     "status": "success"
>>> }
>>>
>>> The next query is taken from the Grafana query inspector, because for 
>>> reasons I don't understand I can't get Prometheus to give me any data when 
>>> I issue the same query to /api/v1/query_range. The query is the same as the 
>>> above query, but wrapped in a rate([1m]):
>>>
>>>     "request": {
>>>         "url": 
>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>         "method": "GET",
>>>         "hideFromInspector": false
>>>     },
>>>     "response": {
>>>         "status": "success",
>>>         "data": {
>>>             "resultType": "matrix",
>>>             "result": [
>>>                 {
>>>                     "metric": {/* redacted */},
>>>                     "values": [
>>>                         [
>>>                             1649239200,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239260,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239320,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239380,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239440,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239500,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239560,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239620,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239680,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239740,
>>>                             "9391.766666666665"
>>>                         ],
>>>                         [
>>>                             1649239800,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239860,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239920,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649239980,
>>>                             "0"
>>>                         ],
>>>                         [
>>>                             1649240040,
>>>                             "0.03333333333333333"
>>>                         ],
>>>                         [
>>>                             1649240100,
>>>                             "0"
>>>                         ]
>>>                     ]
>>>                 }
>>>             ]
>>>         }
>>>     }
>>> }
>>>
>>> Given the gradual increase in the underlying counter, I have two 
>>> questions:
>>>
>>> 1. How come the rate is 0 for all except 2 datapoints?
>>> 2. How come there is one enormous datapoint in the rate query, that is 
>>> seemingly unexplained in the raw data?
>>>
>>> For 2 I've seen in other threads that the explanation is an 
>>> unintentional counter reset, caused by scrapes a millisecond apart that 
>>> make the counter appear to go down for a single scrape interval. I don't 
>>> think I see this in our raw data, though.
>>>
>>> We're using Prometheus version 2.26.0, revision 
>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to