Hey. On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:
You want to "capture" single scrape failures? Sure - it's already being captured. Make yourself a dashboard. Well as I've said before, the dashboard always has the problem that someone actually needs to look at it. But do you really want to be *alerted* on every individual one-time scrape failure? That goes against the whole philosophy of alerting <https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>, where alerts should be "urgent, important, actionable, and real". A single scrape failure is none of those. I guess in the end I'll see whether or not I'm annoyed by it. ;-) How often do you get hosts where: (1) occasional scrape failures occur; and (2) there are enough of them to make you investigate further, but not enough to trigger any alerts? So far I've seen two kinds of nodes, those where I never get scrape errors, and those where they happen regularly - and probably need investigation. Anyway,... I think it might have found a solution, which - if some assumption's I've made are correct - I'm somewhat confident that it works, even in the strange cases. The assumptions I've made are basically three: - Prometheus does that "faking" of sample times, and thus these are always on point with exactly the scrape interval between each. This in turn should mean, that if I have e.g. a scrape interval of 10s, and I do up[20s], then regardless of when this is done, I get at least 2 samples, and in some rare cases (when the evaluation happens exactly on a scrape time), 3 samples. Never more, never less. Which for `up` I think should be true, as Prometheus itself generates it, right, and not the exporter that is scraped. - The evaluation interval is sufficiently less than the scrape interval, so that it's guaranteed that none of the `up`-samples are being missed. - After some small time (e.g. 10s) it's guaranteed that all samples are in the TSDB and a query will return them. (basically, to counter the observation I've made in https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg ) - Both alerts run in the same alert group, and that means (I hope) that each query in them is evaluated with respect to the very same time. With that, my final solution would be: - alert: general_target-down (TD below) expr: 'max_over_time(up[1m] offset 10s) == 0' for: 0s - alert: general_target-down_single-scrapes (TDSS below) expr: 'resets(up[20s] offset 60s) >= 1 unless max_over_time(up[50s] offset 10s) == 0' for: 0s And that seems to actually work for at least practical cases (of course it's difficult to simulate the cases where the evaluation happens right on time of a scrape). For anyone who'd ever be interested in the details, and why I think that works in all cases, I've attached the git logs where I describe the changes in my config git below. Thanks to everyone for helping me with that :-) Best wishes, Chris. (needs a mono-spaced font to work out nicely) TL/DR: ------------------------------------------------- commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43 Author: Christoph Anton Mitterer <cales...@gmail.com> Date: Mon Mar 25 02:01:57 2024 +0100 alerts: overhauled the `general_target-down_single-scrapes`-alert This is a major overhaul of the `general_target-down_single-scrapes`-alert, which turned out to have been quite an effort that went over several months. Before this branch was merged, the `general_target-down_single-scrapes`-alert (from now on called “TDSS”) had various issues. While the alert did stop to fire, when the `general_target-down`-alert (from now on called “TD”) started to do so, that alone meant that it would still also fire when scrapes failed which eventually turned out to be an actual TD. For example the first few (< ≈7) `0`s would have caused TDSS to fire which would seamlessly be replaced by a firing TD (unless any `1`s came in between). Assumptions made below: • The scraping interval is `10s`. • If a (single) time series for the `up`-metric is given like `0 1 0 0 1`, the time goes from left (farther back in time) to right (less farther back in time). I) Goals ******** There should be two alerts: • TD Is for general use and similar to Icinga’s concept of host being `UP` or `DOWN` (with the minor difference, that an unreachable Prometheus target does not necessarily mean that a host is `DOWN` in that sense). It should fire after scraping has failed for some time, for example one minute (which is assumed form now on). • TDSS Since Prometheus is all about monitoring metrics, it’s of interest whether the scraping fails, even if it’s only every now and then for very short amount of times, because in that cases samples are lost. TD will notice any scraping failures that last for more than its time, but won’t notice any that last less. TDSS shall notice these, but only fire if they are not part of an already ongoing TD and neither will be part of one. The idea is that is an alert for the monitoring itself. Whether each firing alert actually results in a notification being sent is of course a different matter and depends on the configuration of the `alertmanager` (the current route that matches the alert name `general_target-down_single-scrapes` in `alertmanager.yml` should cause every single firing alert to be sent). Nevertheless, TDSS should fire for even only a single `0` surrounded by `1`s Examples (below the `:` is “now”): 1 1 1 1 1 1 1: neither alert fires 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0: neither alert shall fire yet (it may become either a TD or a TDSS) 1 0 0 0 0 0 0: TD shall fire 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1: TDSS shall fire, not necessarily immediately (that is: exactly with the most recent `1`) but at least eventually, and stop firing. 1 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 1: TDSS shall fire, stop firing, fire again and stop firing again. 1 0 1 0 0 0 0 0 0: TDSS shall fire, stop firing, then TD shall fire. 1 0 0 0 0 0 0 1 0 0 0 0 0 0: TD shall fire, stop firing, and fire again. II) Prometheus’ Mode Of Operation ********************************* Neither an alert’s `for:` (which is however not used here anyway) nor the queries are made in terms of number-of-samples but time durations. There is no way to make a query like `metric<6 samples>`, which would then (assuming a scrape interval of 10s) be some time around 1 minute. Instead a query like `metric[1m]` gives any samples from now until 1m ago. Usually, this will be 6 samples, in some cases it may be 7 samples (namely when the request is made exactly at the time of a sample), in principle it may be even only 5 samples (namely, when there is jitter and the samples aren’t recorded exactly on time) and for most metrics it could be any other number down to 0 (namely if metrics couldn’t be scraped for some reason). `up` is however special and “generated” by Prometheus itself and should be always there, even if the target couldn’t be scraped. Moreover, Prometheus (at least within some tolerance) fakes (see [0]) the times of samples to be straight on time, so for example a query like `up[1m]` will result in times/samples like: 1711333608.175 "1" 1711333618.175 "1" 1711333628.175 "1" 1711333638.175 "1" 1711333648.175 "1" 1711333658.175 "1" here, all exactly at `*.175`. This means that, relative to some starting point in time, the samples are scraped like this: +0s +10s +20s ├─────┼─────┼┈ ⓢ ⓢ ⓢ ╵ ╵ ╵ Above and below, the 0s, +10s and +20s are scraping and sample times. If Prometheus wouldn’t fake the times of samples ⓢ, this might instead look like: 0s +10s +20s ├─────┼─────┼┈ ⓢ│ │ⓢ ⓢ│ ⓢ│ ⓢ │ⓢ ╵ ╵ ╵ This would then even further complicate what might happen if the “moving” behaviour of queries (as described below) is applied on top of that. With all the above, a query like `up[20s]` may give the following: -20s -10s 0s ├─────┼─────┤ │ ⓢ│ ⓢ│ │ ⓢ │ ⓢ │ │ ⓢ │ ⓢ │ │ ⓢ │ ⓢ │ │ⓢ │ⓢ │ ⓢ ⓢ ⓢ ╵ ╵ ╵ Above, the -20s, -10s and 0s are **not** the interval points at which scraping is performed – they’re rather the duration (which will later be intentionally a multiple of the scrape interval) which the query “looks back”, for visualisation separated in pieces of the length of the scrape interval. This will also be the case in later illustrations where -Ns is used. As the query may happen at any time, the samples ⓢ (which, as described above, happen exactly on time, that is always exactly the scrape interval apart from each other), the samples “move” depending on when the query is made. If the query is made exactly “at” the time of a scraping, one will get even 3 samples (because they, as described above, happen exactly on time). A query like `up[20s] offset 50s` would work analogously, just shifted. With respect to some fixed sample times, and queries made at subsequent times this would look like the following: …00.314s …10.314s …20.314s ┊ ┊ ┊ ┊ ⓢ┊ ┊ ┊ ⓢ┊ ┊ ⓢ┊ ┊ ┊ ⓢ └──┊──┊──┊┴──┊──┊──┊┘ ┊ ┊ ┊ query 1, 2 samples └──┊──┊───┴──┊──┊───┘ ┊ ┊ query 2, 2 samples └──┊──────┴──┊──────┘ ┊ query 3, 2 samples └─────────┴─────────┘ query 4 (exactly at a sample time), 3 samples It follows from all this, that the examples in (I) above are actually only correct in the usual case and a bit misleading how Prometheus respectively it’s queries and thus alerts work. It's not 6 consecutive `0s` as in: 1 0 0 0 0 0 0 that cause TD to fire, but having only `0s` for a time duration (relative to the evaluation time) of 1m from the current evaluation time: -1m 0s ├───────────────────────┤ 1│ 0 0 0 0 0 0│ 1 │ 0 0 0 0 0 0 │ 1 │0 0 0 0 0 0 │ 1 0 0 0 0 0 0 0 ╵ ╵ III) Failed Approaches ********************** In order to fulfil the goals from (I) various approaches have been tried with quite some effort. Each of them ultimately failed for some reason. Some of them are listed here for educational purposes respectively to cause caution which alternatives may fail in subtle cases. These approaches were discussed at [1]. a) Using `min_over_time()` and `max_over_time()`. Based on an idea from Brian Candler an expression for the TDSS like: ``` min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0 ``` with a `for:`-value of `1m` and an expression for the TD like: ``` up == 0 ``` with a `for:`-value of `1m` was tried. The expression for the latter was later changed to: ``` max_over_time(up[1m]) == 0 ``` with a `for:`-value of `0s` in order to make sure that TD would fire exactly when the same term would silence the TDSS. This was tried with evaluation intervals of `10s` and `7s`. The TDSS did never fire with time durations of exactly `1m` (as used by the TD) – it needed to be longer. But that seemed already to be fragile because of the differing times between TDSS and TD. Also, it generally failed when a TD was quickly (probably within ≈ `1m10s`) followed by `0s`, for example: 0 0 0 0 0 0 0 1 0: This would have first caused TDSS to become pending, after the 6th or 7th `0` TD would have fired (while TDSS would have still been pending), after the `1` TD would have stopped firing and with the next `0`, TDSS would have wrongly fired. Something similar would have happened with the `for: 1m`- based TD. In [1] it was also suggested to use different time durations in the TDSS, for example an expression like: ``` min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m0s]) == 0 ``` with a `for:`-value of `1m10s`. This however seemed to have the same issues than above and be even more fragile with respect to the overlapping time windows. b) Using `min_over_time()` only on a critical time window with shifted silencers. The solution from (a) was extended to a TDSS with an expression like: ``` min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m] offset 1m10s) == 0 unless max_over_time(up[1m] offset 1m ) == 0 unless max_over_time(up[1m] offset 50s ) == 0 unless max_over_time(up[1m] offset 40s ) == 0 unless max_over_time(up[1m] offset 30s ) == 0 unless max_over_time(up[1m] offset 20s ) == 0 unless max_over_time(up[1m] offset 10s ) == 0 unless max_over_time(up[1m] ) == 0 ``` with a `for:`-value of `0s` and a TD like above. This was tried with an evaluation interval of `8s`. Using `15s` instead of `10s` was just to account for jitter (which should however not happen anyway – see in (II) above) and should otherwise not matter. The idea was to look only at the time window from -(1m+15s) to -1m (at which it is always clear whether a series of `0`s becomes a TD or a TDSS – though it may also already be clear earlier) for a `0` and silence the alert if it’s actually part of a longer series that forms a TD. There were a number of issues with this approach: It was again fragile with the many overlapping time windows. Further investigation would have been necessary on whether the many shifted silencers may wrongly silence a true TDSS in certain time series patterns or – less problematic – not silence a wrong TDSS. Changing the expression to cover a TD time that is longer than `1m` (while the scrape interval stays short) would have lead to very large (and more complex to evaluate) expressions. It was originally believed that the main problem were a fundamental flaw in the usage of `min_over_time()` on the critical time window, when jitter would have happened like in: -80s -70s -60s 0s ├─────┼─────┼───────────────┤ │1 │0 0│ 0 ⋯ 0 │ case 1 │1 │0 1│ 0 ⋯ 0 │ case 2 │1 │1 0│ 0 ⋯ 0 │ case 3 │1 │1 1│ 0 ⋯ 0 │ case 4 ╵ ╵ ╵ ╵ In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded `0` and in case 4 it would have yielded `1` – all as intended. The silencer with `max_over_time(up[1m])` would have silenced the alert in all cases, which would however have been wrong in case 2, where the single `0` would have went through unnoticed. However and as described in (II) above, jitter should be prevented by Prometheus faking the times of samples and thus a query like `up[10s]` (and similarly with `15s`) should give two samples (assuming a scrape interval of `10s`) only if the evaluation happens exactly at the time where both samples are exactly at the boundaries like in: -80s -70s -60s 0s ├─────┼─────┼───────────────┤ │1 0 0 0 ⋯ 0 │ case 1 │1 0 1 0 ⋯ 0 │ case 2 │1 1 0 0 ⋯ 0 │ case 3 │1 1 1 0 ⋯ 0 │ case 4 ╵ ╵ ╵ ╵ In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded `0` and in case 4 it would have yielded `1` – all as intended. The silencer with `max_over_time(up[1m])` would have silenced the alert only in cases 1 and 3 – again, all as intended. It might have been possible to get this approach working, but at that time it was thought any jitter (which, however, apparently cannot happen) would break it and even without this issue, the shifted silencers may have caused other issues. c) Using `changes()` only on a critical time window with shifted silencers. Chris Siebenmann proposed to use `resets()`, which was however first considered to be not feasible, as for example a time series like `1 0` would have already caused at least a simple expression to cause TDSS to fire, while this might still turn out to be a TD. Instead, the solution from (b) was modified to use `changes()` in an expression like: ``` changes(up[1m5s]) > 1 unless max_over_time(up[1m] offset 1m ) == 0 unless max_over_time(up[1m] offset 50s) == 0 unless max_over_time(up[1m] offset 40s) == 0 unless max_over_time(up[1m] offset 30s) == 0 unless max_over_time(up[1m] offset 20s) == 0 ``` with a `for:`-value of `0s` and a TD like above. This had similar issues than approach (b): It seemed again fragile because of the many overlapping time windows. Cases were found, where the silences would wrongly silence a TDSS, which was then lost. This happened sometimes for a time series like 0 0 0 0 0 0 0 1 0 1 1 …, that is 7 or 8 `0s` which cause a TD, a single `1`, followed by a single `0` (which should be a TDSS), followed by only `1`s. Sometimes (but not always) the single `0` was not detected as TDSS. This was probably dependant on how the respective evaluation time of the alert is shifted compared to the sample times. One property of this approach would have been, that it fires earlier in some (but not all) cases. d) Using `resets()` only on a critical time window with a silencer. Eventually, a probable solution was found by again looking primarily on only a critical time window, however with `resets()`, a single (non-shifted) silencer and a handler for a case where that silencer would wrongly silence the TDSS. The TD was used as above (that is: the version that uses the expression `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`). In the first version of this approach, the following expression for the TDSS was looked at: ``` ( resets(up[20s] offset 1m) >= 1 unless max_over_time(up[1m]) == 0 ) or ( resets(up[20s] offset 1m) >= 1 and changes(up[20s] offset 1m) >= 2 and sum_over_time(up[20s] offset 1m) >= 2 ) ``` where the critical time window is from -80s to -60s (that is exactly before the time window for the TD), but which failed at least in a case like: -80s -70s -60s -10s 0s ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈ │ 1 │ 1 │ 0 ⋯ 0 │ 0 │ 1 1st step 1 │ 1 │ 0 │ 0 ⋯ 0 │ 1 │ 2nd step in which first the TD would have fired and then the TDSS. It might have been possible to solve that by several ways, for example by using `sum_over_time()` or by trying with a shifted silencer. Eventually it also turned out that – given how the scraping and alert rule evaluation works and especially as there’s not time jitter with samples – the whole second term after the `or` was not needed. IV) Testing commands ******************** For the final solution described below (but also in similar forms for the previous approaches), the following commands (each executed in another terminal) or similar were used for testing: • Printing the currently pending or firing alerts: ``` while :; do curl -g 'http://localhost:9090/api/v1/alerts' 2>/dev/null | jq '.data.alerts' | grep -E 'alertname|state'; date +%s.%N; printf '\n'; sleep 1; done ``` • Printing the most recent samples and their times: ``` while :; do curl -g 'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m20s]' 2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed $'3i \n'; printf '%s\n' -------------------------------; sleep 1; done ``` • Causing `0`s: ``` iptables -A OUTPUT --destination testnode.example.org -p tcp -m tcp -j REJECT --reject-with tcp-reset ``` Causing `1`s: For example by reloading the netfilter rules. V) Final Solution ***************** The final solution is based in that shown in (III.d) but uses an overlapping critical time window and a shortened silencer. The TD was used as above (that is: the version that uses the expression `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`). The critical time window has it’s middle exactly at the end of the time window of the TD, so that there’s one length of a scrape interval at both sides (and thus it’s from -70s to -50s) and the silencer goes exactly to the right end of the critical time window, which gives the following expression: ``` resets(up[20s] offset 50s) >= 1 unless max_over_time(up[50s]) == 0 ``` with a `for:`-value of `0s`. The problem described for (III.d) cannot occur, as illustrated below: -70s -60s -50s -10s 0s ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈ │ 1 │ 0 ╎ 0 ⋯ 0 │ 0 │ 1 1st step 1 │ 0 │ 0 ╎ 0 ⋯ 0 │ 1 │ 2nd step in which the 1st step is the “earliest” time series that can cause a TD, but even if this moves on and the next sample is a `1`, the TDSS won’t fire as there is no longer a reset in the critical time window (because due to time jitter being impossible for samples the leftmost `1` must have moved out as the rightmost `1` moves in). The same is the case if the evaluation happens just at a sample time so that the critical time window has three samples. In general, the second version of this approach works as follows: The whole solution depends heavily on time jitter of samples being impossible. It should however be possible to change the time duration of the TD and/or the scrape interval, as long as the time durations and offsets for the TDSS are adapted accordingly. Also, the evaluation interval must be less (enough to account for any evaluation interval jitter) than the scrape interval. In principle should even work (though even that was not thoroughly checked) if it’s equal to it, but since there may be time jitter with the evaluation intervals, samples might be “jumped over”. If it’s greater than the scrape interval, samples are definitely being “jumped over”. In those cases, the check would fail to produce reliable results. Also, the TD and TDSS must be in the same alert group, to assure that they’re evaluated relative to the same current time. A time window of the duration of two scrape intervals is needed, as `resets()` may only ever give a value > 0 if there are at least two samples, which – as described in (II) above – is only assured when looking back that long (which may however also yield up to three samples). If the time window were only one scrape interval long, one would need to be very lucky to get two samples. As in (III.d), the basic idea is to look whether a reset occurred in the critical time window – here – from -70s to -50s and silence the alert if the reset is actually or possibly the start of a TD. Reset means any decrease of the value between two consecutive samples as counted by the `resets()`-function. While `resets()` is documented for use with counters only, it seems to work with gauges, too, and especially with `up` (whose samples may have only the values `0` or `1`) even in a reasonable way. Since there is – as described in (II) above – no jitter with the sample times, the 20s critical time window has always either two samples or three. It consists of two halves, each the size of a scrape interval and it’s especially also impossible, that one half contains two samples and the other only one – both halves contain always the same number of samples (if there are three in total, the middle sample is “shared” by both halves). If there are two samples (which then are not exactly on the boundaries) one gets the following types of cases (at the moment where TDSS might fire): -70s -60s -50s 0s r │ m₅₀ ┃ TDSS m₆₀ ┃ TD ├─────┼─────┼────────────────┤ ───┼─────╂─────── ─────┼─────── │ 0 │ 0 ╎ 0 ⋯ 0 │ 0 │ 0 ┃ - 0 │ fires │ 0 │ 0 ╎ at least one 1 │ 0 │ 1 ┃ - 1 │ - case 2 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ │ 0 │ 1 ╎ 0 ⋯ 0 │ 0 │ 0 ┃ - 1 │ - case 2 │ 0 │ 1 ╎ at least one 1 │ 0 │ 1 ┃ - 1 │ - case 2 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ │ 1 │ 0 ╎ 0 ⋯ 0 │ 1 │ 0 ┃ -ₛ 0 │ fires │ 1 │ 0 ╎ at least one 1 │ 1 │ 1 ┃ fires 1 │ - case 1 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ │ 1 │ 1 ╎ 0 ⋯ 0 │ 0 │ 0 ┃ - 1 │ - │ 1 │ 1 ╎ at least one 1 │ 0 │ 1 ┃ - 1 │ - │ 1 │ 1 ╎ 1 ⋯ 1 │ 0 │ 1 ┃ - 1 │ - with “r” being `resets(up[20s] offset 50s)`, “m₅₀” being `max_over_time(up[50s])` and “m₆₀” being `max_over_time(up[1m])` as well as with `ₛ` meaning that the TDSS was silenced by `max_over_time(up[50s]) == 0`. This gives the desired firing behaviour. Of course, “at least one 1” might contain time series like 1 0 1 and TDSS would still not fire – but it would later, as the time series moves through the critical time window. Similarly in case 1, where the TDSS does fire, it may do so again later, depending on the “at least one 1”. In cases 2, the alert (either a TD or a TDSS) for the consecutive `0`s on the left side, would have already fired earlier. If there are three samples (which then are exactly on the boundaries) one gets the following types of cases (at the moment where TDSS might fire): -70s -60s -50s 0s r │ m₅₀ ┃ TDSS m₆₀ │ TD ├─────┼─────┼────────────────┤ ───┼─────╂─────── ─────┼─────── 0 0 0 0 ⋯ 0 0 0 │ 0 ┃ - 0 │ fires 0 0 0 at least one 1 ⏼ 0 │ 1 ┃ - 1 │ - case 1, 3 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 0 0 1 anything ⏼ 0 │ 1 ┃ - 1 │ - case 1, 3 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 0 1 0 0 ⋯ 0 0 1 │ 0 ┃ -ₛ 1 │ - case 3 0 1 0 at least one 1 ⏼ 1 │ 1 ┃ fires 1 │ - case 2, 3 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 0 1 1 anything ⏼ 0 │ 1 ┃ - 1 │ - case 1, 3 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 1 0 0 0 ⋯ 0 0 1 │ 0 ┃ -ₛ 0 │ fires 1 0 0 at least one 1 ⏼ 1 │ 1 ┃ fires 1 │ - case 2 1 0 0 0 ⋯ 0 1 1 │ 1 ┃ fires 1 │ - case 2, 4 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 1 0 1 anything ⏼ 1 │ 1 ┃ fires 1 │ - case 2 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 1 1 0 0 ⋯ 0 0 1 │ 0 ┃ -ₛ 1 │ - 1 1 0 at least one 1 ⏼ 1 │ 1 ┃ fires 1 │ - case 2 ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤ ┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈ ┈┈┈┈┈┼┈┈┈┈┈┈┈ 1 1 1 anything ⏼ 0 │ 1 │ - 1 │ - case 1 1 1 1 1 ⋯ 1 1 0 │ 1 │ - 1 │ - case 1 with the same legend as above as well as with “⏼” indicating the position of the rightmost sample (which may be `0` or `1`) of “at least one 1”. This gives the desired firing behaviour. As above, the “at least one 1” respectively the “anything” of cases 1 contain time series like 1 0 1 and TDSS would still not fire – but it would later, as the time series moves through the critical time window. Similarly in cases 2, where the TDSS does fire, it may do so again later, depending on the “at least one 1” respectively the “anything”. As above, in cases 3, the alert (either a TD or a TDSS) for the consecutive `0`s on the left side, would have already fired earlier. Case 4, while it has a time series with as many consecutive `0`s that in other cases could have caused at TD, is still not a TD as – per definition – requires only `0`s in the last `1m`. For both, two and three samples in the critical time window: A firing TDSS stops to do so, when the 1 0 that causes the reset in the critical time window, “moves” out of that (it should not be possible to be caused by getting silenced). The time a TDSS fires might vary, depending on the time jitter with the evaluation intervals. If the evaluation interval is sufficiently smaller than the scrape interval (to account for time jitter in the former), it should not be possible that samples are “jumped over”. One interesting example (which includes case 4 above) for this is the following: ┈┈┼─┼─┼─────────┼┈┈ 0 1 0 0 ⋯ 0 0 ⏼ 1st step 0 1 0 0 ⋯ 0 0 ⏼ 2nd step Given the evaluation interval is sufficiently small enough, the leftmost `1` that causes the reset, cannot have moved further “out” than in the 2nd step. But by that time, the next sample `⏼` is assured to have moved “in” and determines whether this is a TD or TDSS. In the `resets(up[20s] offset 50s) >= 1`-term of the TDSS’ expression `>= 1` rather than `= 1` (which in principle would work, too) was used, merely for the conceptual purpose that the alert shall rather fire as a false positive, than not fire. V) Different time durations *************************** Without having been looked at it in detail, if a Prometheus uses additionally larger scrape intervals, the alerts should in principle still work, though TDSS might of course never fire, as nothing would be considered a TDSS. In any case, the time durations for the TDSS must be aligned to the smallest scrape interval in use (and the evaluation interval must be aligned to that, as described above). Some examples for the TDSS expression with different time durations: • TD time duration: `5m`, scrape interval: `10s` ``` resets(up[20s] offset 290s) >= 1 unless max_over_time(up[290s]) == 0 ``` • TD time duration: `1m`, scrape interval: `20s` ``` resets(up[40s] offset 40s) >= 1 unless max_over_time(up[40s]) == 0 ``` • TD time duration: `5m`, scrape interval: `20s` ``` resets(up[40s] offset 280s) >= 1 unless max_over_time(up[280s]) == 0 ``` (Of course, the TD expression would need to be aligned, too, which should however be straightforward.) [0] “query for time series misses samples (that should be there), but not when offset is used” (https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg) [1] “better way to get notified about (true) single scrape failures?” (https://groups.google.com/g/prometheus-users/c/BwJNsWi1LhI) commit b4b5586614b1add0f9cc71629390b8bc223b8181 Author: Christoph Anton Mitterer <cales...@gmail.com> Date: Fri Mar 29 05:48:13 2024 +0100 alerts: shift `general_target-down`- and `general_target-down_single-scrapes`-alerts It was noted that a query like `metric[1m]` made at a time t+ε may not (yet) include the sample at time t if ε is sufficiently small. This could lead to wrong results with the `general_target-down`- and `general_target-down_single-scrapes`-alerts For example the TD could fail in cases like this: ├─┼─┼─────────┤ ┊⏼┊0┊ 0 ⋯ 0 ˽┊ ⏼ 0 0 0 ⋯ 0 ˽ with “⏼” being either a `0` or a `1` and with “˽” being the not yet available sample. Here, the alert would fire, which would only be correct, if ˽ is a `0`. For example the TDSS could fail in cases like this: ├─┼─┼─────────┤ ┊1┊0┊ 0 ⋯ 0 ˽┊ 1 0 0 0 ⋯ 0 ˽ with the same legend as above. Here, the alert would not fire (because it would be silenced), which would only be correct, if ˽ is a `0` – if it’s however a `1`, the TDSS would be missed (and a TD would fire instead). This is solved, by shifting everything a sufficiently large offset in the past, with `10s` seeming to be enough for now. This must be done for both alerts (TD and TDSS) and all queries in their expressions. Those which already have an offset, must of course be shifted further. With the shifted expression, the following may be used as a testing command: • Printing the most recent samples and their times: ``` while :; do curl -g 'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m30s]' 2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed $'3i \n; 9i \n'; printf '%s\n' -------------------------------; sleep 1; done ``` [0] “query for time series misses samples (that should be there), but not when offset is used” (https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg) -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e252ae3a-d2af-4df1-b7c4-ec4a0f3ad4d4n%40googlegroups.com.