Ok @Stuart Clark <[email protected]> , thanks for reaching out.
Would you help me in understanding below queries? ) How endsAt time is calculated? Is that calculated from *resend_delay*? 2) what is the default value of resend_delay or how to check this configuration and in which file it is defined? 3)* msg:Received alerts *logs are received when prometheus sends alerts to alertmanager? msg:flushing* logs get *logged when? (below mail thread) On Tue, Aug 31, 2021, 11:54 PM Stuart Clark <[email protected]> wrote: > In general I don't think you can. You can use the absent function, but > that will also alert if the scrape fails (eg. Host is down). You could also > use one of the over_time functions, but that wouldn't be based on the last > sample before scrapes stopped. > > However, taking a step back, why do you want to keep alerting if scrapes > fail? At that point you have no idea what is actually happening, so the > server might be fine (but a firewall issue has stopped scrapes). You'd end > up wasting time investing a server when you don't need to. > > Instead you'd generally have two alerts - on for your threshold and one > for scrape failures. You'd then investigate whichever fires. > > On 31 August 2021 19:15:20 BST, akshay sharma <[email protected]> > wrote: >> >> Okay, >> >> Use case is : I have an alert which should be raise when threshold value >> mention crosses. Say, it is raised now, during next evaluation interval >> ,the exporter/target is down, so Prometheus will get empty metrics and it >> will mark the alert as resolve. >> >> I want alert to be in firing state only until the target comes back or >> Prometheus gets the metrics back. >> >> How can I avoid this? >> >> On Tue, Aug 31, 2021, 7:49 PM Brian Candler <[email protected]> wrote: >> >>> Sorry, I don't understand what you're saying. It's true that >>> min_over_time works over a range vector, and it returns an instant vector. >>> What's the problem? >>> >>> Alert will not be marked as firing if there is no time series returned >>> by the "expr" in the alerting rule. It is the presence or absence of a >>> timeseries (not its value) which triggers an alert. >>> >>> Perhaps you can provide a concrete example of an alerting rule, and what >>> the problem is (e.g. what behaviour you see that you don't want to see). >>> >>> On Tuesday, 31 August 2021 at 15:00:53 UTC+1 [email protected] wrote: >>> >>>> Thanks Brain, >>>> >>>> So, min_over_time is the only option to avoid that scenario(alert >>>> resolving when metric gone missing ) ?because min_over_time is a range >>>> vector and it will mark alert in firing state or any old state till that >>>> range vector. >>>> but what if service/exporter doesn't come for like 1d or sometimes in >>>> 1h, who could we possibly make it work? >>>> >>>> >>>> On Tue, Aug 31, 2021 at 1:39 PM Brian Candler <[email protected]> >>>> wrote: >>>> >>>>> I can't answer your questions about endsAt because I've never had any >>>>> problem with this, so I've never had to dig into it. As long as your >>>>> alert >>>>> rule evaluation interval is something sensible like 1m or 15s then it >>>>> should be fine. (This is defined at the level of the rule *group*, or the >>>>> global evaluation_interval if not specified there) >>>>> >>>>> It sounds like the problem is that the alerting expression is >>>>> resolving transiently - see also >>>>> https://groups.google.com/g/prometheus-users/c/yVL8e257VvU >>>>> >>>>> To prove this, enter the alerting expression as-is into the PromQL >>>>> browser: >>>>> >>>>> 100 - >>>>> (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} * >>>>> 100) > x% >>>>> >>>>> Graph it over time and look for small gaps. And/or trim it to >>>>> >>>>> 100 - >>>>> (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} * >>>>> 100) >>>>> >>>>> and look for values which are above and below x% which could cause the >>>>> alert to be resolved briefly before re-firing. >>>>> >>>>> There's no easy way around this except to make your alerting >>>>> expressions more robust. For example, suppose you're alerting on >>>>> >>>>> expr: up == 0 >>>>> >>>>> but the value of "up" goes like this when a machine becomes >>>>> overloaded: 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 ... >>>>> >>>>> You're going to get a new alert every time it goes 0 1 0. So instead >>>>> you might write it like this: >>>>> >>>>> expr: min_over_time(up[10m]) == 0 >>>>> >>>>> This will alert as soon as it goes to 0, but will only stop alerting >>>>> when it has been at 1 for 10 minutes continuously (or the entire metric is >>>>> missing for 10 minutes continuously). This is straightforward, but >>>>> unfortunately more sophisticated cases are tricky. Also, it's harder to >>>>> cope with the case of occasional failures like 1 1 1 1 1 1 0 1 1 1 1 1 1 >>>>> without sending spurious alerts. You could try building an expression >>>>> with >>>>> count_over_time instead, or by joining on the ALERTS metric which exists >>>>> for alerts that are already firing, although joining on metrics which >>>>> might >>>>> not exist is not easy: >>>>> https://www.robustperception.io/left-joins-in-promql >>>>> >>>>> Aside 1: the divide operator only generates results where the >>>>> numerator and denominator have exactly the same set of labels, so your >>>>> expression can be simplified to >>>>> >>>>> 100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes >>>>> * 100) > x% >>>>> >>>>> without any change in behaviour. >>>>> >>>>> Aside 2: if the idea is to warn when a disk is *soon going to be full* >>>>> then you can do something more sophisticated than static thresholds: e.g. >>>>> >>>>> - name: DiskRate3h >>>>> interval: 10m >>>>> rules: >>>>> # Warn if rate of growth over last 3 hours means filesystem will >>>>> fill in 2 days >>>>> - alert: DiskFilling >>>>> expr: | >>>>> >>>>> predict_linear(node_filesystem_avail_bytes{instance!~"foo|bar|baz",fstype!~"fuse.*|nfs.*"}[3h], >>>>> 2*86400) < 0 >>>>> for: 6h >>>>> labels: >>>>> severity: warning >>>>> annotations: >>>>> summary: 'Filesystem will be full in less than 2d at current 3h >>>>> growth rate' >>>>> >>>>> This means that a disk which is sitting at 91% full but unchanging >>>>> won't alert, but one which goes from 91% to 92% to 93% over 3 hours will >>>>> alert. It's something to consider anyway, perhaps in conjunction with >>>>> higher static thresholds so you get a quick alert when the filesystem >>>>> reaches 98% (say). >>>>> >>>>> Remember that "alerts" in principle are things which may get someone >>>>> out of bed, and should be something that can be immediately actioned, and >>>>> resolved (or at worst silenced while the problem is fixed soon). If you >>>>> have alerts which are continuously firing, it leads to "alert fatigue" >>>>> very >>>>> quickly. There's an excellent document here (by an ex-Google site >>>>> reliability engineer): >>>>> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/ >>>>> >>>>> -- >>>>> >>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/01badbf8-e167-4a81-8629-8e19d916fd34n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/01badbf8-e167-4a81-8629-8e19d916fd34n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/ffa84197-798c-498a-b8ff-d58f1369ec74n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/ffa84197-798c-498a-b8ff-d58f1369ec74n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAOrgXNKnuAbP7d7313C0KLL210BsjBv_ox5%2Bod_spAU_qscCVQ%40mail.gmail.com.

