Personally I think you're looking at this wrong.

You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
<https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.

If you want to do further investigation when a host has more than N 
single-scrape failures in 24 hours, sure. But firstly, is that urgent 
enough to warrant an alert? If it is, then you also say you *don't* want to 
be alerted on this when a more important alert has been sent for the same 
host in the same time period.  That's tricky to get right, which is what 
this whole thread is about. Like you say: alertmanager is probably not the 
right tool for that.

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?

If it's "not often" then I wouldn't worry too much it anyway (check a 
dashboard), but in any case you don't want to waste time trying to bend 
existing tooling to work in ways it wasn't intended for. That is: if you 
need suitable tooling, then write it.

It could be as simple as a script doing one query per day, using the same 
logic I just outlined above:
- identify hosts with scrape failures above a particular threshold over the 
last 24 hours
- identify hosts where one or more alerts have been generated over the last 
24 hours (there are metrics for this)
- subtract the second set from the first set
- if the remaining set is non-empty, then send a notification

You can do this in any language of your choice, or even a shell script with 
promtool/curl and jq.

On Friday 22 March 2024 at 02:31:52 UTC Christoph Anton Mitterer wrote:

>
> I've been looking into possible alternatives, based on the ideas given 
> here.
>
> I) First one completely different approach might be:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
> - alert: single-scrape-failure
> expr: 'min_over_time( up[2m0s] ) == 0'
> for: 1m
> or
> - alert: single-scrape-failure
> expr: 'resets( up[2m0s] ) > 0'
> for: 1m
> or perhaps even
> - alert: single-scrape-failure
> expr: 'changes( up[2m0s] ) >= 2'
> for: 1m
> (which would however behave a bit different, I guess)
> )
>
> plus an inhibit rule, that silences single-scrape-failure when
> target-down fires.
> The for: 1m is needed, so that target-down has a chance to fire
> (and inhibit) before single-scrape-failure does.
>
> I'm not really sure, whether that works in all cases, though,
> especially since I look back much more (and the additional time
> span further back may undesirably trigger again.
>
>
> Using for: > 0 seems generally a bit fragile for my use-case (because I 
> want to capture even single scrape failures, but with for: > 0 I need t to 
> have at least two evaluations to actually trigger, so my evaluation period 
> must be small enough so that it's done >= 2 during the scrape interval.
>
> Also, I guess the scrape intervals and the evaluation intervals are not 
> synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
> certain number of samples in that range, it may be that there are actually 
> more or less.
>
>
> If I forget about the above approach with inhibiting, then I need to 
> consider cases like:
> ----time---->
> - 0 1 0 0 0 0 0 0
> first zero should be a single-scrape-failure, the last 6 however a
> target-down
> - 1 0 0 0 0 0 1 0 0 0 0 0 0
> same here, the first 5 should be a single-scrape-failure, the last 6
> however a target-down
> - 1 0 0 0 0 0 0 1 0 0 0 0 0 0
> here however, both should be target-down
> - 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
> or
> 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
> here, 2x target-down, 1x single-scrape-failure
>
>
>
>
> II) Using the original {min,max}_over_time approach:
> - min_over_time(up[1m]) == 0
> tells me, there was at least one missing scrape in the last 1m.
> but that alone would already be the case for the first zero:
> . . . . . 0
> so:
> - for: 1m
> was added (and the [1m] was enlarged)
> but this would still fire with
> 0 0 0 0 0 0 0
> which should however be a target-down
> so:
> - unless max_over_time(up[1m]) == 0
> was added to silence it then
> but that would still fail in e.g. the case when a previous
> target-down runs out:
> 0 0 0 0 0 0 -> target down
> the next is a 1
> 0 0 0 0 0 0 1 -> single-scrape-failure
> and some similar cases,
>
> Plus the usage of for: >0s is - in my special case - IMO fragile.
>
>
>
> III) So in my previous mail I came up with the idea of using:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
> alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
> unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 
> 1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless 
> max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] 
> offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless 
> max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] 
> offset 10s) == 0' for: 0m
> The idea was, that when I don't use for: >0s, the first time
> window where one can be really sure (in all cases), that whether
> it's a single-scrape-failure or target-down is a 0 in -70s to
> -60s:
> -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
> | | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | 
> | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 
> | 0 | 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the 
> zeros is between -20
> and -10.
> But if there's a sequence of zeros, it takes up to -70s to -60s,
> when it becomes clear.
>
> Now the zero in that time span could also be that of a target-down
> sequence of zeros like in case 3.
> For these cases, I had the shifted silencers that each looked over
> 1m.
>
> Looked good at first, though there were some open questions.
> At least one main problem, namely it would fail in e.g. that case:
> -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
> | 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a
> The zero between -70s to 60s would be noticed, but still be
> silenced, because the one would not.
>
>
>
>
> Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, 
> which Ben Kochie, suggested, too.
>
> First I didn't quite understand how the latter would help me? Maybe I have 
> the wrong mindset for it, so could you guys please explain what your idea 
> was wiht keep_firing_for:?
>
>
>
>
> IV) resets() sounded promising at first, but while I tried quite some
> variations, I wasn't able to get anything working.
> First, something like
> resets(up[1m]) >= 1
> alone (with or without a for: >0s) would already fire in case of:
> ----time---->
> 1 0
> which still could become a target-down but also in case of:
> 1 0 0 0 0 0 0
> which is a target down.
> And I think even if I add some "unless ..." I'd still have the
> problem as above in (II), that I get a false positive alert, when
> a true target-down sequence moves through.
> So just like in (III) I'd need do that shifted silencers.
>
> resets(up[1m]) >= 2
> wouldn't work either e.g. in case of:
> 1 0 1 1 1 1 1 1
> there simply is no 2nd reset.
>
> I even tried a variant where the target-down must come first in the
> rules definition:
> - alert: target-down expr: 'up == 0' for: 1m <- for is needed here, or I 
> get no ALERTS - alert: single-scrape-failure expr: 'resets(up[1m0s]) > 0 
> unless on (instance,job) ALERTS{alertname="target-down"}' for: 0m and 
> where I then used ALERTS trying to filter ... but no success.
>
> V) Instead of resets() I tried changes() (which is even not only
> defined for counters):
> - alert: target-down
> expr: 'max_over_time( up[1m0s] ) == 0'
> for: 0s
> - alert: single-scrape-failure
> expr: 
>
> using just
> changes(up[1]) >= 1
> does of course not work, as it could be an incoming target-down
> 1 0 0 0 0 0 0
> or an outgoing one:
> 0 0 0 0 0 0 1
>
> using
> changes(up[1]) >= 2
> seems promising first, if I have e.g.
> 1 1 1 1 0 1
> it's already clear, that it's a single-scrape-failure...
> but it could be something like 0 0 0 0 0 0 1 1 0 0 0
> i.e. an outgoing target-down and something that may still become
> one.
>  
>
> using
> changes(up[1m5s]) >= 2
> unless max_over_time(up[1m0s] offset 1m) == 0
> unless max_over_time(up[1m0s] offset 50s) == 0
> unless max_over_time(up[1m0s] offset 40s) == 0
> unless max_over_time(up[1m0s] offset 30s) == 0
> unless max_over_time(up[1m0s] offset 20s) == 0
> used the above, and filtered again the shifted 1m time spans (no
> need to look at offset 0s or 10s).
>
> But that fails e.g. in the case of 
> 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
> (i.e. a target-down followed by a single-scrape-failure followed by
> OK)
>
>
>
>
> VI) avg_over_time.
> I guess I might just not understand what you mean, but at least
> something like:
> expr: 'avg_over_time(up[1m10s]) < 1 and avg_over_time(up[1m10s]) > 0'
> for: 1m
> fails already in the simple case of
> 0 0 0 0 0 1
> where it gives a false alert after the target-down
>
>
> Well... guess I'm a my wits' end and this might simply not be possible 
> with PromQL.
>
> Cheers,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/31433f4e-fbf1-4f6d-af87-64c55bdc77e3n%40googlegroups.com.

Reply via email to