Hey Brian On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:
It depends on the exact semantics of "for". e.g. take a simple case of 1 minute rule evaluation interval. If you apply "for: 1m" then I guess that means the alert must be firing for two successive evaluations (otherwise, "for: 1m" would have no effect). Seems you're right. I did quite some testing meanwhile with the following alertmanager route (note, that I didn't use 5m, but 1m... simply in order to not have to wait so long): routes: - match_re: alertname: 'td.*' receiver: admins_monitoring group_by: [alertname] group_wait: 0s group_interval: 1s and the following rules: groups: - name: alerts_general_single-scrapes interval: 15s rules: - alert: td-fast expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0' for: 1m - alert: td expr: 'up == 0' for: 1m My understanding is, correct me if wrong, that basically prometheus would run a thread for the scrape job (which in my case would have an interval of 15s) and another one that evaluates the alert rules (above every 15s) which then sends the alert to the alertmanager (if firing). It felt a bit brittle to have the rules evaluated with the same period then the scrapes, so I did all tests once with 15s for the rules interval, and once with 10s. But it seems as if this wouldn't change the behaviour. But up[5m] only looks at samples wholly contained within a 5 minute window, and therefore will normally only look at 5 samples. As you can see above,... I had already noticed that you were indeed right before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need to look back 5 * evaluation_interval(15s) = 75s At least in my tests, that seemed to cause the desired behaviour, except for one case: When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there is... within (less than?) 1m, another sequence of "0"s that eventually cause a "slow" td. In that case, td-fast fires for a while, until it directly switches over to td firing. Was your idea above with something like: > expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0 > for: 7m intended to fix that issue? Or could one perhaps use ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] == 1 somehow, to check whether it did fire... and then silence the false positive. (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples) Jitter in the sense that the samples are taken at slightly different times? Do you think that could affect the desired behaviour? I would intuitively expect that it rather only cases the "base duration" not be be exactly e.g. 1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would happen +/- 15s earlier (and conversely for td-slow). Another point I basically don't understand... how does all that relate to the scrap intervals? The plain up == 0 simply looks at the most recent sample (going back up to 5m as you've said in the other thread). The series up[Ns] looks back N seconds, giving whichever samples are within there and now. AFAIU, there it doesn't go "automatically" back any further (like the 5m above), right? In order for the for: to work I need at least two samples... so doesn't that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in the above example), the above two alerts will never fire, even if it's down? So if I had e.g. some jobs scraping only every 10m ... I'd need another pair of td/td-fast alerts, which then filter on the job (up{job="longRunning"}) and either only have td... (if that makes sense) ... or at td-fast for if one of the every-10m-scrape fails and an even long "slow" td like if that fails for 1h. If what I've written above is correct (and it may well not be!), then expr: up == 0 for: 5m will fire if "up" is zero for 6 cycles, whereas As far as I understand you... 6 cycles of rule evaluation interval... with at least two samples within that interval, right? ... unless max_over_time(up[5m]) will suppress an alert if "up" is zero for (usually) 5 cycles. Last but not least an (only) partially related question: Once an alert fires (in prometheus), even i just for one evaluation interval cycle.... and there is no inhibiton rule or so in alertmanager... is it expected that a notification is sent out for sure,... regardless of alertmanagers grouping settings? Like when the alert fires for one short 15s evaluation interval and clears again afterwards,... but group_wait: is set to some 7d ... is it expected to send that singe firing event after 7d, even if it has resolved already once the 7d are over and there was .g. no further firing in between? Thanks a lot :-) Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ae5eff91-29ad-4145-9ea5-1afc9b2a5f72n%40googlegroups.com.