[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Christoph Anton Mitterer Sun, 17 Mar 2024 18:29:48 -0700

Hey there.

I eventually got back to this and I'm still fighting this problem.


As a reminder, my goal was:
- if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
  how Icinga would put the host into down state, after pings failed or a
  number of seconds)
- but even if a single scrape fails (which alone wouldn't trigger the above
  alert) I'd like to get a notification (telling me, that something might be
  fishy with the networking or so), that is UNLESS that single failed scrape
  is part of a sequence of failed scrapes that also caused / will cause the
  above target-down alert

Assuming in the following, each number is a sample value with ~10s distance 
for
the `up` metric of a single host, with the most recent one being the 
right-most:
- 1 1 1 1 1 1 1 => should give nothing
- 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single 
failure,
                   or develop into the target-down alert)
- 1 1 1 1 1 0 0 => same as above, not clear yet
...
- 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

In the following:
- 1 1 1 1 1 0 1
- 1 1 1 1 0 0 1
- 1 1 1 0 0 0 1
...
should eventually (not necessarily after the right-most 1, though) all give 
a
"single-scrape-failure" (even though it's more than just one - it's not a
target-down), simply because there's a 0s but for a time span less than 1m.

- 1 0 1 0 0 0 0 0 0
should give both, a single-scrape-failure alert (the left-most single 0) 
AND a
target-down alert (the 6 consecutive zeros)

-           1 0 1 0 1 0 0 0
should give at least 2x a single-scrape-failure alert, and for the leftmost
zeros, it's not yet clear what they'll become.
-   0 0 0 0 0 0 0 0 0 0 0 0  (= 2x six zeros)
should give only 1 target-down alert
- 0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
should give 2 target-down alerts

Whether each of such alerts (e.g. in the 1 0 1 0 1 0 ...) case actually 
results
in a notification (mail) is of course a different matter, and depends on the
alertmanager configuration, but at least the alert should fire and with the 
right
alert-manager config one should actually get a notification for each single 
failed
scrape.


Now, Brian has already given me some pretty good ideas how do them 
basically the
ideas were:
(assuming that 1m makes the target down, and a scrape interval of 10s)

For the target-down alert:
a) expr: 'up == 0'
   for:  1m
b) expr: 'max_over_time(up[1m]) == 0'
   for:  0s
=> here (b) was probably better, as it would use the same condition as is 
also used
   in the alert below, and there can be no weird timing effects depending 
on the
   for: an when these are actually evaluated.

For the single-scrape-failiure alert:
A) expr: min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m]) == 0
   for: 1m10s
   (numbers a bit modified from Brian's example, but I think the idea is 
the same)
B) expr: min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0
   for: 1m

=> I did test (B) quite a lot, but there was at least still one case where 
it failed
   and that was when there were two consecutive but distinct target-down 
errors, that
   is:
   0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
   which would eventually look like e.g. 
   0 1 0 0 0 0 0 0   or   0 0 1 0 0 0 0 0
   in the above check, and thus trigger (via the left-most zeros) a false
   single-scrape-failiure alert.

=> I'm not so sure whether I truly understand (A),... especially with 
respect to any
   niche cases, when there's jitter or so (plus, IIRC, it also failed in 
the case
   described for (B).


One approach I tried in the meantime was to use sum_over_time .. and then 
the idea was
simply to check how mane ones there are for each case. But it turns out 
that even if
everything runs normal, the sum is not stable... some times, over [1m] I 
got only 5,
whereas most times it was 6.
Not really sure how that comes, because the printed timestamps for each 
sample seem to
be suuuuper accurate (all the time), but the sum wasn't.


So I tried a different approach now, based on the above from Brian,... 
which at least in
tests looks promising so far... but I'd like to hear what experts think 
about it.

- both alerts have to be in the same alert groups (I assume this assures 
they're then
  evaluated in the same thread and at the "same time" (that is, with 
respect to the same
  reference timestamp).
- in my example I assume a scrape time of 10s and evaluation interval of 7s 
(not really
  sure whether the latter matters or could be changed while the rules stay 
the same - and
  it would still work or not)
- for: is always 0s ... I think that's good, because at least to me it's 
unclear, how
  things are evaluated if the two alerts have different values for for:, 
especially in
  border cases.
- rules:
    - alert: target-down
      expr: 'max_over_time( up[1m0s] )  ==  0'
      for:  0s
    - alert: single-scrape-failure
      expr: 'min_over_time(up[15s] offset 1m) == 0 unless 
max_over_time(up[1m0s]) == 0
                           unless max_over_time(up[1m0s] offset 1m10s) == 0
                           unless max_over_time(up[1m0s] offset 1m) == 0
                           unless max_over_time(up[1m0s] offset 50s) == 0
                           unless max_over_time(up[1m0s] offset 40s) == 0
                           unless max_over_time(up[1m0s] offset 30s) == 0
                           unless max_over_time(up[1m0s] offset 20s) == 0
                           unless max_over_time(up[1m0s] offset 10s) == 0'
      for:  0m
I think the intended working of target-down is obvious so let me explain 
the ideas behind
single-scrape-failure:

I divide the time spans I look at:
 -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   
-30s   -20s   -10s   0s/now
   |       |       |       |       |      |      |  0   |      |      
|      |      |      |      |     case 1
   |       |       |       |       |      |      |  0   |  0   |  0   |  0 
  |  0   |  0   |  0   |     case 2
   |       |       |       |       |      |      |  0   |  1   |  0   |  0 
  |  0   |  0   |  0   |     case 3
   |       |       |       |       |      |      |  0   |  1   |  1   |  1 
  |  1   |  1   |  1   |     case 4
   |       |       |       |       |      |  1   |  0   |  1   |  0   |  0 
  |  0   |  0   |  0   |     case 5
   |       |       |       |       |      |  1   |  0   |  1   |  1   |  1 
  |  1   |  1   |  1   |     case 6
   |   1   |   0   |   0   |   0   |  0   |  0   |  0   |  1   |  0   |  0 
  |  0   |  0   |  0   |     case 7
1: Having a 0 somewhere between -70s and -60s is the mandatory for a single 
scrape failure.
   For every 0 more rightwards it's not yet clear which case it will end up 
as (well actually
   it may be already clear, if there's a 1 even more right, but that's to 
complex to check
   and not really needed).
   For every 0 more leftwards (later than -70s) the alert, if any, would 
have already fired
   when between -70s and -60s.

   So I check this via:
       min_over_time(up[15s] offset 1m) == 0
   not really sure about the 15s ... the idea is to account for jitter, 
i.e. if there was
   only one 0 and that came a bit early and eas already before -70s.
   I guess the question here is, what happens if I do:
       min_over_time(up[10s] offset 1m)
   and there is NO sample between -70 and -60 ?? Does it take the next 
older one? Or the
   next newer?

2: Should not be single-scrape-failure, but a target-down failure.
   This I get via the:
      unless max_over_time(up[1m0s]) == 0

3, 4: are actually undefined, because I didn't fill in the older numbers, 
so maybe there
      was another 1m full of 0 after the leftmost (which would have then 
been it's own
      target-down alert)
5, 6: Here it's clear, the 0 between -70 and -60 must be 
single-scrape-failures and should
      alert, which they do already if the rule were just:
      expr: min_over_time(up[15s] offset 1m) == 0 unless 
max_over_time(up[1m0s]) == 0
7: These fail if we had just:
      expr: min_over_time(up[15s] offset 1m) == 0 unless 
max_over_time(up[1m0s]) == 0
   because, the 0 between -70 and -60 is actually NOT a single-scrape 
failure, but a
   part of a target-down alert.

This is, where the:
                           unless max_over_time(up[1m0s] offset 1m10s) == 0
                           unless max_over_time(up[1m0s] offset 1m   ) == 0
                           unless max_over_time(up[1m0s] offset 50s  ) == 0
                           unless max_over_time(up[1m0s] offset 40s  ) == 0
                           unless max_over_time(up[1m0s] offset 30s  ) == 0
                           unless max_over_time(up[1m0s] offset 20s  ) == 0
                           unless max_over_time(up[1m0s] offset 10s  ) == 0
come into play.
The idea is that I make a number of excluding conditions, which are the 
same as the expr
for target-down, just shifted around the important interval from -70 to -60:
 -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   
-30s   -20s   -10s   0s/now
   |       |       |       |       |      |      |  0   |  X   |  X   |  X 
  |  X   |  X   |  X   |      unless max_over_time(up[1m0s]             ) 
== 0
   |       |       |       |       |      |      |  0/X |  X   |  X   |  X 
  |  X   |  X   |      |      unless max_over_time(up[1m0s] offset 10s  ) 
== 0
   |       |       |       |       |      |  X   |  0/X |  X   |  X   |  X 
  |  X   |      |      |      unless max_over_time(up[1m0s] offset 20s  ) 
== 0
   |       |       |       |       |  X   |  X   |  0/X |  X   |  X   |  X 
  |      |      |      |      unless max_over_time(up[1m0s] offset 30s  ) 
== 0
   |       |       |       |   X   |  X   |  X   |  0/X |  X   |  X   |    
  |      |      |      |      unless max_over_time(up[1m0s] offset 40s  ) 
== 0
   |       |       |   X   |   X   |  X   |  X   |  0/X |  X   |      |    
  |      |      |      |      unless max_over_time(up[1m0s] offset 50s  ) 
== 0
   |       |   X   |   X   |   X   |  X   |  X   |  0/X |      |      |    
  |      |      |      |      unless max_over_time(up[1m0s] offset 1m   ) 
== 0
   |   X   |   X   |   X   |   X   |  X   |  X   |  0   |      |      |    
  |      |      |      |      unless max_over_time(up[1m0s] offset 1m10s) 
== 0

X simply denotes whether the 10s interval is part of the respective 1m 
interval.
0/X is simply when the important interval from -70 to -60 is also part of 
that, which doesn't
matter as it's anyway 0 and we use max_over_time.

So, *if* the important interval from -70 to -60 is 0, it looks at the 
shifted 1m intervals,
whether any of those was a target-down alert, and if so, causes not to fire.


Now there's still man open questions.

First, and perhaps more rhetorical:
Why is this so hard to do in Prometheus? I know Prometheus isn't 
Icinga/Nagios, but there a
failed probe would immediately cause the check to go into UNKNOWN state.
For Prometheus, whose main purpose is scraping of metrics, one should 
assume that people may
at least have a simply way to get notified, if these scrapes fail.


But more concrete questions:
1) Does the above solution sound reasonable?
2) What about my up[15s] offset 1m ... should it be only [10s]? Or 
something else?
   (btw: The 10+5s is obviously one scrape interval + less (I took half) 
than one scrape interval)
3) Should the more or less corresponding
     unless max_over_time(up[1m0s] offset 1m10s) == 0
   be rather
     unless max_over_time(up[1m5s] offset 1m10s) == 0
4) The question from above:
   > what happens if I do:
   >     min_over_time(up[10s] offset 1m)
   > and there is NO sample between -70 and -60 ?? Does it take the next 
older one? Or the
   > next newer?
5) I split up the time spans in chunks of 10s, which is my scrape interval.
   Is that even reasonable? Or should it rather be split up in evaluation 
intervals?
6) How do the above alerts, depend on the evaluation interval? I mean will 
they still work as expected
   if I use e.g. the scrape interval (10s)? Or could this cause the two 
intervals to be overlaid in just
   the wrong manner? Same if I'd use any divisor of the scrape interval, 
like 5s, 2s or 1s?
   What if I'd use a evaluation interval *bigger* than the scrape interval?
7) In all my above 10s intervals:
   -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s 
  -30s   -20s   -10s   0s/now
     |       |       |       |       |      |      |      |      |      
|      |      |      |      |
   The query is always inclusive on both ends, right?
   So if a sample would lay e.g. exactly on -70s, it would count for both 
intervals, the one
   from -80 to -70 and the one from -70 to -60.

   I'm a bit unsure whether or not that matter for my alerts.
   Intuitively not, because my expressions look all at intervals (there is 
no for: Xs) or so
   and if the sample is right at the border, well that simply means both 
intervals have that value.
   And if there's another sample in the same interval, the max_ and min_ 
functions should just
   do the right thing (I... kinda guess ^^).
8) I also thought what would happen if there are multiple samples in one 
interval e.g.:
   -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s 
  -30s   -20s   -10s   0s/now
     |   1   |   1   |   1   |   1   |  1   |  1   |  0 1 |  0   |  0   |  
0   |  0   |  0   |  0   |     case 8a
     |   1   |   1   |   1   |   1   |  1   |  1   |  1 0 |  0   |  0   |  
0   |  0   |  0   |  0   |     case 8b

   8a, 8b: min_over_time for the -70s to -60s interval would be 0 in both 
cases,
           but in 8a, that would mean single-scrape-failure is lost.

           No idea how one can solve this. I guess not at all. :-(
           Perhaps by using an evaluation interval that prevents this 
mostly, e.g.
           7s evaluation interval for 10s scrape interval.

           Or could one solve this by using count_over_time or 
last_over_time?


*If* that approach of mine (largely based on Brian's ideas), would indeed 
work as intended...
there's still one problem left:

If one wants to make a longer period after which target-down fires (e.g. 
5m, rather than 1m)
but still keep the short scrape time of 10s, one gets and awfully big 
expression (which probably
doesn't execute faster, the longer it gets).

Any ideas how to make that better?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/791bc259-05cf-40e9-b1bf-2177a467aa4cn%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to