Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

Christoph Anton Mitterer Thu, 04 Apr 2024 09:19:06 -0700

Hey.

On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:


You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.


Well as I've said before, the dashboard always has the problem that someone 
actually needs to look at it.
 

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
<https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.


I guess in the end I'll see whether or not I'm annoyed by it. ;-)
 

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?


So far I've seen two kinds of nodes, those where I never get scrape errors, 
and those where they happen regularly - and probably need investigation.


Anyway,... I think it might have found a solution, which - if some
assumption's I've made are correct - I'm somewhat confident that
it works, even in the strange cases.


The assumptions I've made are basically three:
- Prometheus does that "faking" of sample times, and thus these are
  always on point with exactly the scrape interval between each.
  This in turn should mean, that if I have e.g. a scrape interval of
  10s, and I do up[20s], then regardless of when this is done, I get
  at least 2 samples, and in some rare cases (when the evaluation
  happens exactly on a scrape time), 3 samples.
  Never more, never less.
  Which for `up` I think should be true, as Prometheus itself
  generates it, right, and not the exporter that is scraped.
- The evaluation interval is sufficiently less than the scrape
  interval, so that it's guaranteed that none of the `up`-samples are
  being missed.
- After some small time (e.g. 10s) it's guaranteed that all samples
  are in the TSDB and a query will return them.
  (basically, to counter the observation I've made in
  https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg )
- Both alerts run in the same alert group, and that means (I hope) that
  each query in them is evaluated with respect to the very same time.

With that, my final solution would be:
    - alert: general_target-down   (TD below)
      expr: 'max_over_time(up[1m] offset 10s) == 0'
      for:  0s
    - alert: general_target-down_single-scrapes   (TDSS below)
      expr: 'resets(up[20s] offset 60s) >= 1  unless  max_over_time(up[50s] 
offset 10s) == 0'
      for:  0s

And that seems to actually work for at least practical cases (of
course it's difficult to simulate the cases where the evaluation
happens right on time of a scrape).

For anyone who'd ever be interested in the details, and why I think that 
works in all cases,
I've attached the git logs where I describe the changes in my config git 
below.

Thanks to everyone for helping me with that :-)

Best wishes,
Chris.


(needs a mono-spaced font to work out nicely)
TL/DR:
-------------------------------------------------
commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43
Author: Christoph Anton Mitterer <cales...@gmail.com>
Date:   Mon Mar 25 02:01:57 2024 +0100

    alerts: overhauled the `general_target-down_single-scrapes`-alert
    
    This is a major overhaul of the 
`general_target-down_single-scrapes`-alert,
    which turned out to have been quite an effort that went over several 
months.
    
    Before this branch was merged, the 
`general_target-down_single-scrapes`-alert
    (from now on called “TDSS”) had various issues.
    While the alert did stop to fire, when the `general_target-down`-alert 
(from now
    on called “TD”) started to do so, that alone meant that it would still 
also fire
    when scrapes failed which eventually turned out to be an actual TD.
    For example the first few (< ≈7) `0`s would have caused TDSS to fire 
which would
    seamlessly be replaced by a firing TD (unless any `1`s came in between).
    
    Assumptions made below:
    • The scraping interval is `10s`.
    • If a (single) time series for the `up`-metric is given like `0 1 0 0 
1`, the
      time goes from left (farther back in time) to right (less farther 
back in
      time).
    
    I) Goals
    ********
    There should be two alerts:
    • TD
      Is for general use and similar to Icinga’s concept of host being `UP` 
or
      `DOWN` (with the minor difference, that an unreachable Prometheus 
target does
      not necessarily mean that a host is `DOWN` in that sense).
      It should fire after scraping has failed for some time, for example 
one
      minute (which is assumed form now on).
    • TDSS
      Since Prometheus is all about monitoring metrics, it’s of interest 
whether the
      scraping fails, even if it’s only every now and then for very short 
amount of
      times, because in that cases samples are lost.
      TD will notice any scraping failures that last for more than its 
time, but
      won’t notice any that last less.
      TDSS shall notice these, but only fire if they are not part of an 
already
      ongoing TD and neither will be part of one.
      The idea is that is an alert for the monitoring itself.
    
    Whether each firing alert actually results in a notification being sent 
is of
    course a different matter and depends on the configuration of the
    `alertmanager` (the current route that matches the alert name
    `general_target-down_single-scrapes` in `alertmanager.yml` should cause 
every
    single firing alert to be sent).
    Nevertheless, TDSS should fire for even only a single `0` surrounded by 
`1`s
    
    Examples (below the `:` is “now”):
    1 1 1 1 1 1 1: neither alert fires
    
    1 1 1 1 1 1 0
    1 1 1 1 1 0 0
    1 1 1 1 0 0 0
    1 1 1 0 0 0 0
    1 1 0 0 0 0 0: neither alert shall fire yet (it may become either a TD 
or a
                   TDSS)
    
    1 0 0 0 0 0 0: TD shall fire
    
    1 1 1 1 1 0 1
    1 1 1 1 0 0 1
    1 1 1 0 0 0 1
    1 1 0 0 0 0 1
    1 0 0 0 0 0 1: TDSS shall fire, not necessarily immediately (that is: 
exactly
                   with the most recent `1`) but at least eventually, and 
stop
                   firing.
    
    1 1 1 0 1 0 1
    1 1 0 1 0 0 1
    1 0 0 1 0 0 1: TDSS shall fire, stop firing, fire again and stop firing 
again.
    
              1 0 1 0 0 0 0 0 0: TDSS shall fire, stop firing, then TD 
shall fire.
    1 0 0 0 0 0 0 1 0 0 0 0 0 0: TD shall fire, stop firing, and fire again.
    
    II) Prometheus’ Mode Of Operation
    *********************************
    Neither an alert’s `for:` (which is however not used here anyway) nor 
the
    queries are made in terms of number-of-samples but time durations.
    There is no way to make a query like `metric<6 samples>`, which would 
then
    (assuming a scrape interval of 10s) be some time around 1 minute. 
Instead a
    query like `metric[1m]` gives any samples from now until 1m ago.
    Usually, this will be 6 samples, in some cases it may be 7 samples 
(namely when
    the request is made exactly at the time of a sample), in principle it 
may be
    even only 5 samples (namely, when there is jitter and the samples aren’t
    recorded exactly on time) and for most metrics it could be any other 
number down
    to 0 (namely if metrics couldn’t be scraped for some reason).
    
    `up` is however special and “generated” by Prometheus itself and should 
be
    always there, even if the target couldn’t be scraped.
    
    Moreover, Prometheus (at least within some tolerance) fakes (see [0]) 
the times
    of samples to be straight on time, so for example a query like `up[1m]` 
will
    result in times/samples like:
    1711333608.175 "1"
    1711333618.175 "1"
    1711333628.175 "1"
    1711333638.175 "1"
    1711333648.175 "1"
    1711333658.175 "1"
    here, all exactly at `*.175`.
    This means that, relative to some starting point in time, the samples 
are
    scraped like this:
    +0s   +10s  +20s
     ├─────┼─────┼┈
     ⓢ     ⓢ     ⓢ
     ╵     ╵     ╵
    Above and below, the 0s, +10s and +20s are scraping and sample times.
    If Prometheus wouldn’t fake the times of samples ⓢ, this might instead 
look
    like:
     0s   +10s  +20s
     ├─────┼─────┼┈
    ⓢ│     │ⓢ   ⓢ│
    ⓢ│     ⓢ     │ⓢ
     ╵     ╵     ╵
    This would then even further complicate what might happen if the 
“moving”
    behaviour of queries (as described below) is applied on top of that.
    
    With all the above, a query like `up[20s]` may give the following:
    -20s  -10s   0s
     ├─────┼─────┤
     │    ⓢ│    ⓢ│
     │   ⓢ │   ⓢ │
     │  ⓢ  │  ⓢ  │
     │ ⓢ   │ ⓢ   │
     │ⓢ    │ⓢ    │
     ⓢ     ⓢ     ⓢ
     ╵     ╵     ╵
    Above, the -20s, -10s and 0s are **not** the interval points at which 
scraping
    is performed – they’re rather the duration (which will later be 
intentionally a
    multiple of the scrape interval) which the query “looks back”, for 
visualisation
    separated in pieces of the length of the scrape interval. This will 
also be the
    case in later illustrations where -Ns is used.
    As the query may happen at any time, the samples ⓢ (which, as described 
above,
    happen exactly on time, that is always exactly the scrape interval 
apart from
    each other), the samples “move” depending on when the query is made.
    If the query is made exactly “at” the time of a scraping, one will get 
even 3
    samples (because they, as described above, happen exactly on time).
    A query like `up[20s] offset 50s` would work analogously, just shifted.
    
    With respect to some fixed sample times, and queries made at subsequent 
times
    this would look like the following:
           …00.314s  …10.314s  …20.314s
    ┊         ┊         ┊         ┊
    ⓢ┊  ┊  ┊  ⓢ┊  ┊     ⓢ┊  ┊  ┊  ⓢ
     └──┊──┊──┊┴──┊──┊──┊┘  ┊  ┊  ┊    query 1, 2 samples
        └──┊──┊───┴──┊──┊───┘  ┊  ┊    query 2, 2 samples
           └──┊──────┴──┊──────┘  ┊    query 3, 2 samples
              └─────────┴─────────┘    query 4 (exactly at a sample time), 
3 samples
    
    It follows from all this, that the examples in (I) above are actually 
only
    correct in the usual case and a bit misleading how Prometheus 
respectively it’s
    queries and thus alerts work.
    It's not 6 consecutive `0s` as in:
    1 0 0 0 0 0 0
    that cause TD to fire, but having only `0s` for a time duration 
(relative to the
    evaluation time) of 1m from the current evaluation time:
       -1m                      0s
        ├───────────────────────┤
       1│  0   0   0   0   0   0│
      1 │ 0   0   0   0   0   0 │
     1  │0   0   0   0   0   0  │
    1   0   0   0   0   0   0   0
        ╵                       ╵
    
    III) Failed Approaches
    **********************
    In order to fulfil the goals from (I) various approaches have been 
tried with
    quite some effort.
    Each of them ultimately failed for some reason.
    Some of them are listed here for educational purposes respectively to 
cause
    caution which alternatives may fail in subtle cases.
    
    These approaches were discussed at [1].
    
    a) Using `min_over_time()` and `max_over_time()`.
       Based on an idea from Brian Candler an
       expression for the TDSS like:
       ```
       min_over_time(up[1m10s]) == 0  unless  max_over_time(up[1m10s]) == 0
       ```
       with a `for:`-value of `1m` and an expression for the TD like:
       ```
       up == 0
       ```
       with a `for:`-value of `1m` was tried.
       The expression for the latter was later changed to:
       ```
       max_over_time(up[1m]) == 0
       ```
       with a `for:`-value of `0s` in order to make sure that TD would fire 
exactly
       when the same term would silence the TDSS.
       This was tried with evaluation intervals of `10s` and `7s`.
    
       The TDSS did never fire with time durations of exactly `1m` (as used 
by the
       TD) – it needed to be longer. But that seemed already to be fragile 
because
       of the differing times between TDSS and TD.
       Also, it generally failed when a TD was quickly (probably within ≈ 
`1m10s`)
       followed by `0s`, for example:
       0 0 0 0 0 0 0 1 0: This would have first caused TDSS to become 
pending, after
                          the 6th or 7th `0` TD would have fired (while 
TDSS would
                          have still been pending), after the `1` TD would 
have
                          stopped firing and with the next `0`, TDSS would 
have
                          wrongly fired.
                          Something similar would have happened with the 
`for: 1m`-
                          based TD.
    
       In [1] it was also suggested to use different time durations in the 
TDSS,
       for example an expression like:
       ```
       min_over_time(up[1m20s]) == 0  unless  max_over_time(up[1m0s]) == 0
       ```
       with a `for:`-value of `1m10s`.
       This however seemed to have the same issues than above and be even 
more
       fragile with respect to the overlapping time windows.
    
    b) Using `min_over_time()` only on a critical time window with shifted
       silencers.
       The solution from (a) was extended to a TDSS with an expression like:
       ```
       min_over_time(up[15s] offset 1m) == 0
       unless  max_over_time(up[1m] offset 1m10s) == 0
       unless  max_over_time(up[1m] offset 1m   ) == 0
       unless  max_over_time(up[1m] offset 50s  ) == 0
       unless  max_over_time(up[1m] offset 40s  ) == 0
       unless  max_over_time(up[1m] offset 30s  ) == 0
       unless  max_over_time(up[1m] offset 20s  ) == 0
       unless  max_over_time(up[1m] offset 10s  ) == 0
       unless  max_over_time(up[1m]             ) == 0
       ```
       with a `for:`-value of `0s` and a TD like above.
       This was tried with an evaluation interval of `8s`.
       Using `15s` instead of `10s` was just to account for jitter (which 
should
       however not happen anyway – see in (II) above) and should otherwise 
not
       matter.
       The idea was to look only at the time window from -(1m+15s) to -1m 
(at which
       it is always clear whether a series of `0`s becomes a TD or a TDSS – 
though
       it may also already be clear earlier) for a `0` and silence the 
alert if it’s
       actually part of a longer series that forms a TD.
    
       There were a number of issues with this approach: It was again 
fragile with
       the many overlapping time windows. Further investigation would have 
been
       necessary on whether the many shifted silencers may wrongly silence 
a true
       TDSS in certain time series patterns or – less problematic – not 
silence a
       wrong TDSS. Changing the expression to cover a TD time that is 
longer than
       `1m` (while the scrape interval stays short) would have lead to very 
large
       (and more complex to evaluate) expressions.
    
       It was originally believed that the main problem were a fundamental 
flaw in
       the usage of `min_over_time()` on the critical time window, when 
jitter would
       have happened like in:
       -80s  -70s  -60s             0s
        ├─────┼─────┼───────────────┤
        │1    │0   0│    0  ⋯  0    │    case 1
        │1    │0   1│    0  ⋯  0    │    case 2
        │1    │1   0│    0  ⋯  0    │    case 3
        │1    │1   1│    0  ⋯  0    │    case 4
        ╵     ╵     ╵               ╵
       In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded 
`0` and
       in case 4 it would have yielded `1` – all as intended. The silencer 
with
       `max_over_time(up[1m])` would have silenced the alert in all cases, 
which
       would however have been wrong in case 2, where the single `0` would 
have went
       through unnoticed.
       However and as described in (II) above, jitter should be prevented by
       Prometheus faking the times of samples and thus a query like 
`up[10s]` (and
       similarly with `15s`) should give two samples (assuming a scrape 
interval of
       `10s`) only if the evaluation happens exactly at the time where both 
samples
       are exactly at the boundaries like in:
       -80s  -70s  -60s             0s
        ├─────┼─────┼───────────────┤
        │1    0     0    0  ⋯  0    │    case 1
        │1    0     1    0  ⋯  0    │    case 2
        │1    1     0    0  ⋯  0    │    case 3
        │1    1     1    0  ⋯  0    │    case 4
        ╵     ╵     ╵               ╵
       In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded 
`0` and
       in case 4 it would have yielded `1` – all as intended. The silencer 
with
       `max_over_time(up[1m])` would have silenced the alert only in cases 
1 and 3 –
       again, all as intended.
    
       It might have been possible to get this approach working, but at 
that time it
       was thought any jitter (which, however, apparently cannot happen) 
would break
       it and even without this issue, the shifted silencers may have 
caused other
       issues.
    
    c) Using `changes()` only on a critical time window with shifted 
silencers.
       Chris Siebenmann proposed to use
       `resets()`, which was however first considered to be not feasible, 
as for
       example a time series like `1 0` would have already caused at least 
a simple
       expression to cause TDSS to fire, while this might still turn out to 
be a TD.
    
       Instead, the solution from (b) was modified to use `changes()` in an
       expression like:
       ```
       changes(up[1m5s]) > 1
       unless  max_over_time(up[1m] offset 1m ) == 0
       unless  max_over_time(up[1m] offset 50s) == 0
       unless  max_over_time(up[1m] offset 40s) == 0
       unless  max_over_time(up[1m] offset 30s) == 0
       unless  max_over_time(up[1m] offset 20s) == 0
       ```
       with a `for:`-value of `0s` and a TD like above.
    
       This had similar issues than approach (b): It seemed again fragile 
because of
       the many overlapping time windows. Cases were found, where the 
silences would
       wrongly silence a TDSS, which was then lost. This happened sometimes 
for a
       time series like 0 0 0 0 0 0 0 1 0 1 1 …, that is 7 or 8 `0s` which 
cause a
       TD, a single `1`, followed by a single `0` (which should be a TDSS), 
followed
       by only `1`s. Sometimes (but not always) the single `0` was not 
detected as
       TDSS. This was probably dependant on how the respective evaluation 
time of
       the alert is shifted compared to the sample times.
    
       One property of this approach would have been, that it fires earlier 
in some
       (but not all) cases.
    
    d) Using `resets()` only on a critical time window with a silencer.
       Eventually, a probable solution was found by again looking primarily 
on only
       a critical time window, however with `resets()`, a single 
(non-shifted)
       silencer and a handler for a case where that silencer would wrongly 
silence
       the TDSS.
       The TD was used as above (that is: the version that uses the 
expression
       `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`).
    
       In the first version of this approach, the following expression for 
the TDSS
       was looked at:
       ```
       (
        resets(up[20s] offset 1m) >= 1
        unless  max_over_time(up[1m]) == 0
       )
       or
       (
        resets(up[20s] offset 1m) >= 1
        and  changes(up[20s] offset 1m) >= 2
        and  sum_over_time(up[20s] offset 1m) >= 2
       )
       ```
       where the critical time window is from -80s to -60s (that is exactly 
before
       the time window for the TD), but which failed at least in a case 
like:
           -80s  -70s  -60s            -10s   0s
       ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈
            │  1  │  1  │    0  ⋯  0    │  0  │  1      1st step
         1  │  1  │  0  │    0  ⋯  0    │  1  │         2nd step
       in which first the TD would have fired and then the TDSS.
       It might have been possible to solve that by several ways, for 
example by
       using `sum_over_time()` or by trying with a shifted silencer.
    
       Eventually it also turned out that – given how the scraping and 
alert rule
       evaluation works and especially as there’s not time jitter with 
samples – the
       whole second term after the `or` was not needed.
    
    IV) Testing commands
    ********************
    For the final solution described below (but also in similar forms for 
the
    previous approaches), the following commands (each executed in another 
terminal)
    or similar were used for testing:
    • Printing the currently pending or firing alerts:
      ```
      while :; do curl -g 'http://localhost:9090/api/v1/alerts' 2>/dev/null 
| jq '.data.alerts' | grep -E 'alertname|state'; date +%s.%N; printf '\n'; 
sleep 1; done
      ```
    • Printing the most recent samples and their times:
      ```
      while :; do curl -g 
'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m20s]'
 
2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed 
$'3i \n'; printf '%s\n' -------------------------------; sleep 1; done
      ```
    • Causing `0`s:
      ```
      iptables -A OUTPUT --destination testnode.example.org -p tcp -m tcp 
-j REJECT --reject-with tcp-reset
      ```
      Causing `1`s:
      For example by reloading the netfilter rules.
    
    V) Final Solution
    *****************
    The final solution is based in that shown in (III.d) but uses an 
overlapping
    critical time window and a shortened silencer.
    The TD was used as above (that is: the version that uses the expression
    `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`).
    
    The critical time window has it’s middle exactly at the end of the time 
window
    of the TD, so that there’s one length of a scrape interval at both 
sides (and
    thus it’s from -70s to -50s) and the silencer goes exactly to the right 
end of
    the critical time window, which gives the following expression:
    ```
    resets(up[20s] offset 50s) >= 1  unless  max_over_time(up[50s]) == 0
    ```
    with a `for:`-value of `0s`.
    
    The problem described for (III.d) cannot occur, as illustrated below:
        -70s  -60s  -50s            -10s   0s
    ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈
         │  1  │  0  ╎    0  ⋯  0    │  0  │  1      1st step
      1  │  0  │  0  ╎    0  ⋯  0    │  1  │         2nd step
    in which the 1st step is the “earliest” time series that can cause a 
TD, but
    even if this moves on and the next sample is a `1`, the TDSS won’t fire 
as there
    is no longer a reset in the critical time window (because due to time 
jitter
    being impossible for samples the leftmost `1` must have moved out as the
    rightmost `1` moves in). The same is the case if the evaluation happens 
just at
    a sample time so that the critical time window has three samples.
    
    In general, the second version of this approach works as follows:
    
    The whole solution depends heavily on time jitter of samples being 
impossible.
    It should however be possible to change the time duration of the TD 
and/or the
    scrape interval, as long as the time durations and offsets for the TDSS 
are
    adapted accordingly.
    Also, the evaluation interval must be less (enough to account for any 
evaluation
    interval jitter) than the scrape interval.
    In principle should even work (though even that was not thoroughly 
checked) if
    it’s equal to it, but since there may be time jitter with the evaluation
    intervals, samples might be “jumped over”. If it’s greater than the 
scrape
    interval, samples are definitely being “jumped over”. In those cases, 
the check
    would fail to produce reliable results.
    Also, the TD and TDSS must be in the same alert group, to assure that 
they’re
    evaluated relative to the same current time.
    
    A time window of the duration of two scrape intervals is needed, as 
`resets()`
    may only ever give a value > 0 if there are at least two samples, which 
– as
    described in (II) above – is only assured when looking back that long 
(which may
    however also yield up to three samples).
    If the time window were only one scrape interval long, one would need 
to be very
    lucky to get two samples.
    
    As in (III.d), the basic idea is to look whether a reset occurred in the
    critical time window – here – from -70s to -50s and silence the alert 
if the
    reset is actually or possibly the start of a TD.
    
    Reset means any decrease of the value between two consecutive samples 
as counted
    by the `resets()`-function.
    While `resets()` is documented for use with counters only, it seems to 
work with
    gauges, too, and especially with `up` (whose samples may have only the 
values
    `0` or `1`) even in a reasonable way.
    
    Since there is – as described in (II) above – no jitter with the sample 
times,
    the 20s critical time window has always either two samples or three.
    
    It consists of two halves, each the size of a scrape interval and it’s
    especially also impossible, that one half contains two samples and the 
other
    only one – both halves contain always the same number of samples (if 
there are
    three in total, the middle sample is “shared” by both halves).
    
    If there are two samples (which then are not exactly on the boundaries) 
one gets
    the following types of cases (at the moment where TDSS might fire):
    -70s  -60s  -50s              0s    r │ m₅₀ ┃ TDSS       m₆₀ ┃  TD
     ├─────┼─────┼────────────────┤    ───┼─────╂───────    ─────┼───────
     │  0  │  0  ╎    0  ⋯  0     │     0 │  0  ┃   -         0  │ fires
     │  0  │  0  ╎ at least one 1 │     0 │  1  ┃   -         1  │   
-       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  0  │  1  ╎    0  ⋯  0     │     0 │  0  ┃   -         1  │   
-       case 2
     │  0  │  1  ╎ at least one 1 │     0 │  1  ┃   -         1  │   
-       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  1  │  0  ╎    0  ⋯  0     │     1 │  0  ┃   -ₛ        0  │ fires
     │  1  │  0  ╎ at least one 1 │     1 │  1  ┃ fires       1  │   
-       case 1
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  1  │  1  ╎    0  ⋯  0     │     0 │  0  ┃   -         1  │   -
     │  1  │  1  ╎ at least one 1 │     0 │  1  ┃   -         1  │   -
     │  1  │  1  ╎    1  ⋯  1     │     0 │  1  ┃   -         1  │   -
    
    with “r” being `resets(up[20s] offset 50s)`, “m₅₀” being
    `max_over_time(up[50s])` and “m₆₀” being `max_over_time(up[1m])` as 
well as with
    `ₛ` meaning that the TDSS was silenced by `max_over_time(up[50s]) == 0`.
    
    This gives the desired firing behaviour.
    
    Of course, “at least one 1” might contain time series like 1 0 1 and 
TDSS would
    still not fire – but it would later, as the time series moves through 
the
    critical time window.
    Similarly in case 1, where the TDSS does fire, it may do so again later,
    depending on the “at least one 1”.
    
    In cases 2, the alert (either a TD or a TDSS) for the consecutive `0`s 
on the
    left side, would have already fired earlier.
    
    If there are three samples (which then are exactly on the boundaries) 
one gets
    the following types of cases (at the moment where TDSS might fire):
    -70s  -60s  -50s              0s    r │ m₅₀ ┃ TDSS       m₆₀ │  TD
     ├─────┼─────┼────────────────┤    ───┼─────╂───────    ─────┼───────
     0     0     0    0  ⋯  0     0     0 │  0  ┃   -         0  │ fires
     0     0     0 at least one 1 ⏼     0 │  1  ┃   -         1  │   
-       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     0     1    anything    ⏼     0 │  1  ┃   -         1  │   
-       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     1     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        1  │   
-       case 3
     0     1     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   
-       case 2, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     1     1    anything    ⏼     0 │  1  ┃   -         1  │   
-       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     0     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        0  │ fires
     1     0     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   
-       case 2
     1     0     0    0  ⋯  0     1     1 │  1  ┃ fires       1  │   
-       case 2, 4
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     0     1    anything    ⏼     1 │  1  ┃ fires       1  │   
-       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     1     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        1  │   -
     1     1     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   
-       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     1     1    anything    ⏼     0 │  1  │   -         1  │   
-       case 1
     1     1     1    1  ⋯  1     1     0 │  1  │   -         1  │   
-       case 1
    with the same legend as above as well as with “⏼” indicating the 
position of the
    rightmost sample (which may be `0` or `1`) of “at least one 1”.
    
    This gives the desired firing behaviour.
    
    As above, the “at least one 1” respectively the “anything” of cases 1 
contain
    time series like 1 0 1 and TDSS would still not fire – but it would 
later, as
    the time series moves through the critical time window.
    Similarly in cases 2, where the TDSS does fire, it may do so again 
later,
    depending on the “at least one 1” respectively the “anything”.
    
    As above, in cases 3, the alert (either a TD or a TDSS) for the 
consecutive `0`s
    on the left side, would have already fired earlier.
    
    Case 4, while it has a time series with as many consecutive `0`s that 
in other
    cases could have caused at TD, is still not a TD as – per definition – 
requires
    only `0`s in the last `1m`.
    
    For both, two and three samples in the critical time window:
    
    A firing TDSS stops to do so, when the 1 0 that causes the reset in the 
critical
    time window, “moves” out of that (it should not be possible to be 
caused by
    getting silenced).
    The time a TDSS fires might vary, depending on the time jitter with the
    evaluation intervals.
    
    If the evaluation interval is sufficiently smaller than the scrape 
interval (to
    account for time jitter in the former), it should not be possible that 
samples
    are “jumped over”. One interesting example (which includes case 4 
above) for
    this is the following:
    ┈┈┼─┼─┼─────────┼┈┈
      0 1 0  0 ⋯ 0  0 ⏼    1st step
    0 1 0  0 ⋯ 0  0 ⏼      2nd step
    Given the evaluation interval is sufficiently small enough, the 
leftmost `1`
    that causes the reset, cannot have moved further “out” than in the 2nd 
step. But
    by that time, the next sample `⏼` is assured to have moved “in” and 
determines
    whether this is a TD or TDSS.
    
    In the `resets(up[20s] offset 50s) >= 1`-term of the TDSS’ expression 
`>= 1`
    rather than `= 1` (which in principle would work, too) was used, merely 
for the
    conceptual purpose that the alert shall rather fire as a false 
positive, than
    not fire.
    
    V) Different time durations
    ***************************
    Without having been looked at it in detail, if a Prometheus uses 
additionally
    larger scrape intervals, the alerts should in principle still work, 
though TDSS
    might of course never fire, as nothing would be considered a TDSS.
    In any case, the time durations for the TDSS must be aligned to the 
smallest
    scrape interval in use (and the evaluation interval must be aligned to 
that, as
    described above).
    
    Some examples for the TDSS expression with different time durations:
    • TD time duration: `5m`, scrape interval: `10s`
      ```
      resets(up[20s] offset 290s) >= 1  unless  max_over_time(up[290s]) == 0
      ```
    • TD time duration: `1m`, scrape interval: `20s`
      ```
      resets(up[40s] offset 40s) >= 1  unless  max_over_time(up[40s]) == 0
      ```
    • TD time duration: `5m`, scrape interval: `20s`
      ```
      resets(up[40s] offset 280s) >= 1  unless  max_over_time(up[280s]) == 0
      ```
    
    (Of course, the TD expression would need to be aligned, too, which 
should
    however be straightforward.)
    
    [0] “query for time series misses samples (that should be there), but 
not when
        offset is used” 
(https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg)
    [1] “better way to get notified about (true) single scrape failures?”
        (https://groups.google.com/g/prometheus-users/c/BwJNsWi1LhI)

commit b4b5586614b1add0f9cc71629390b8bc223b8181
Author: Christoph Anton Mitterer <cales...@gmail.com>
Date:   Fri Mar 29 05:48:13 2024 +0100

    alerts: shift `general_target-down`- and 
`general_target-down_single-scrapes`-alerts
    
    It was noted that a query like `metric[1m]` made at a time t+ε may not 
(yet)
    include the sample at time t if ε is sufficiently small.
    
    This could lead to wrong results with the `general_target-down`- and
    `general_target-down_single-scrapes`-alerts
    
    For example the TD could fail in cases like this:
    ├─┼─┼─────────┤
    ┊⏼┊0┊ 0 ⋯ 0  ˽┊
    ⏼ 0 0  0 ⋯ 0  ˽
    with “⏼” being either a `0` or a `1` and with “˽” being the not yet 
available
    sample.
    Here, the alert would fire, which would only be correct, if ˽ is a `0`.
    
    For example the TDSS could fail in cases like this:
    ├─┼─┼─────────┤
    ┊1┊0┊ 0 ⋯ 0  ˽┊
    1 0 0  0 ⋯ 0  ˽
    with the same legend as above.
    Here, the alert would not fire (because it would be silenced), which 
would only
    be correct, if ˽ is a `0` – if it’s however a `1`, the TDSS would be 
missed (and
    a TD would fire instead).
    
    This is solved, by shifting everything a sufficiently large offset in 
the past,
    with `10s` seeming to be enough for now.
    
    This must be done for both alerts (TD and TDSS) and all queries in their
    expressions. Those which already have an offset, must of course be 
shifted
    further.
    
    With the shifted expression, the following may be used as a testing 
command:
    • Printing the most recent samples and their times:
      ```
      while :; do curl -g 
'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m30s]'
 
2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed 
$'3i \n; 9i \n'; printf '%s\n' -------------------------------; sleep 1; 
done
      ```
    
    [0] “query for time series misses samples (that should be there), but 
not when
        offset is used” 
(https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg)


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e252ae3a-d2af-4df1-b7c4-ec4a0f3ad4d4n%40googlegroups.com.

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to