[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Christoph Anton Mitterer Fri, 12 May 2023 19:26:23 -0700

Hey Brian

On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:


It depends on the exact semantics of "for". e.g. take a simple case of 1 
minute rule evaluation interval. If you apply "for: 1m" then I guess that 
means the alert must be firing for two successive evaluations (otherwise, 
"for: 1m" would have no effect).


Seems you're right.

I did quite some testing meanwhile with the following alertmanager route 
(note, that I didn't use 5m, but 1m... simply in order to not have to wait 
so long):
  routes:
  - match_re:
      alertname: 'td.*'
    receiver:       admins_monitoring
    group_by:       [alertname]
    group_wait:     0s
    group_interval: 1s

and the following rules:
groups:
  - name:     alerts_general_single-scrapes
    interval: 15s    
    rules:
    - alert: td-fast
      expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0'
      for:  1m
    - alert: td
      expr: 'up == 0'
      for:  1m

My understanding is, correct me if wrong, that basically prometheus would 
run a thread for the scrape job (which in my case would have an interval of 
15s) and another one that evaluates the alert rules (above every 15s) which 
then sends the alert to the alertmanager (if firing).

It felt a bit brittle to have the rules evaluated with the same period then 
the scrapes, so I did all tests once with 15s for the rules interval, and 
once with 10s. But it seems as if this wouldn't change the behaviour.


But up[5m] only looks at samples wholly contained within a 5 minute window, 
and therefore will normally only look at 5 samples.


As you can see above,... I had already noticed that you were indeed right 
before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need 
to look back 5 * evaluation_interval(15s) = 75s

At least in my tests, that seemed to cause the desired behaviour, except 
for one case:
When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there 
is... within (less than?) 1m, another sequence of "0"s that eventually 
cause a "slow" td. In that case, td-fast fires for a while, until it 
directly switches over to td firing.

Was your idea above with something like:
>    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
>    for: 7m
intended to fix that issue?

Or could one perhaps use 
ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] 
== 1 somehow, to check whether it did fire... and then silence the false 
positive.

 

  (If there is jitter in the sampling time, then occasionally it might look 
at 4 or 6 samples)


Jitter in the sense that the samples are taken at slightly different times?
Do you think that could affect the desired behaviour? I would intuitively 
expect that it rather only cases the "base duration" not be be exactly e.g. 
1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would 
happen +/- 15s earlier (and conversely for td-slow).


Another point I basically don't understand... how does all that relate to 
the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to 
5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within 
there and now. AFAIU, there it doesn't go "automatically" back any further 
(like the 5m above), right?

In order for the for: to work I need at least two samples... so doesn't 
that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in 
the above example), the above two alerts will never fire, even if it's down?

So if I had e.g. some jobs scraping only every 10m ... I'd need another 
pair of td/td-fast alerts, which then filter on the job 
(up{job="longRunning"}) and either only have td... (if that makes sense) 
... or at td-fast for if one of the every-10m-scrape fails and an even long 
"slow" td like if that fails for 1h.


If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas


As far as I understand you... 6 cycles of rule evaluation interval... with 
at least two samples within that interval, right?
 

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.



 Last but not least an (only) partially related question:

Once an alert fires (in prometheus), even i just for one evaluation 
interval cycle.... and there is no inhibiton rule or so in alertmanager... 
is it expected that a notification is sent out for sure,... regardless of 
alertmanagers grouping settings?
Like when the alert fires for one short 15s evaluation interval and clears 
again afterwards,... but group_wait: is set to some 7d ... is it expected 
to send that singe firing event after 7d, even if it has resolved already 
once the 7d are over and there was .g. no further firing in between?


Thanks a lot :-)
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ae5eff91-29ad-4145-9ea5-1afc9b2a5f72n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to