[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Brian Candler Sat, 13 May 2023 03:39:48 -0700

On Saturday, 13 May 2023 at 03:26:18 UTC+1 Christoph Anton Mitterer wrote:

(If there is jitter in the sampling time, then occasionally it might look
at 4 or 6 samples)

Jitter in the sense that the samples are taken at slightly different times?

Yes. Each sample is timestamped with the time the scrape took place.

Consider a 5 minute window which contains generally contains 5 samples at 1
minute intervals:

|...*......*......*......*......*....|...*....

Now consider what happens when one of those samples is right on the
boundary of the window:

|*......*......*......*......*.......|*.......

Depending on the exact timings that the scrape takes place, it's possible
that the first sample could fall outside:

*|......*......*......*......*.......|*.......

Or the next sample could fall inside:

|*......*......*......*......*......*|.......

Do you think that could affect the desired behaviour?

In my experience, the scraping regularity of Prometheus is very good (just
try putting "up[5m]" into the PromQL browser and looking at the timestamps
of the samples, they seem to increment in exact intervals). Oo it's
unlikely to happen much, and it might when the system is under high load, I
guess. Or it might never happen, if Prometheus writes the timestamps of
the times it *wanted* to make the scrape, not when it actually occurred.
Determining that would require looking in source code.

Another point I basically don't understand... how does all that relate to
the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to
5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within
there and now. AFAIU, there it doesn't go "automatically" back any further
(like the 5m above), right?

That's correct.

So if you're trying to make mutual expressions which fire in case A but not
B, and case B but not A, then you'd probably be better off writing then to
both use up[5m].

min_over_time(up[5m]) == 0 # use this instead of "up == 0 // for: 5m"
for the main alert.

In order for the for: to work I need at least two samples

No, you just need two rule evaluations. The rule evaluation interval
doesn't have to be the same as the scrape interval, and even if they are
the same, they are not synchronized.

If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

(*rule evaluation* cycles, if your rule evaluation interval is 1m)

As far as I understand you... 6 cycles of rule evaluation interval... with
at least two samples within that interval, right?

No. The expression "up" is evaluated at each rule evaluation time, and it
gives the most recent value of "up", looking back up to 5 minutes.

So if you had a scrape interval of 2 minutes, with a rule evaluation
interval of 1 minute it could be that two rule evaluations of "up" see the
same scraped value.

(This can also happen in real life with a 1 minute scrape interval, if you
have a failed scrape)

Once an alert fires (in prometheus), even i just for one evaluation
interval cycle.... and there is no inhibiton rule or so in alertmanager...
is it expected that a notification is sent out for sure,... regardless of
alertmanagers grouping settings?

There is group_wait. If the alert were to trigger and clear within the
group_wait interval, I'd expect no alert to be sent. But I've not tested
that.

Like when the alert fires for one short 15s evaluation interval and clears
again afterwards,... but group_wait: is set to some 7d ... is it expected
to send that singe firing event after 7d, even if it has resolved already
once the 7d are over and there was .g. no further firing in between?

You'll need to test it, but my expectation would be that it wouldn't send
*anything* for 7 days (while it waits for other similar alerts to appear),
and if all alerts have disappeared within that period, that nothing would
be sent. However, I don't know if the 7 day clock resets as soon as all
alerts go away, or it continues to tick. If this matters to you, then test
it.

Nobody in their right might would use 7d for group_wait of course.
Typically you might set it to around a minute, so that if a bunch of
similar alerts fire within that 1 minute period, they are gathered together
into a single notification rather than a slew of separate notifications.

HTH,

Brian.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/c2aa08ef-e27d-4c3c-b364-8a064d0fc7d0n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to