Re: [prometheus-users] Re: Understanding the parameter endsAt and resend-delay.

akshay sharma Tue, 31 Aug 2021 12:51:15 -0700

Ok @Stuart Clark <[email protected]> , thanks for reaching out.


Would you help me in understanding below queries?


) How endsAt time is calculated? Is that calculated from *resend_delay*?
2) what is the default value of resend_delay or how to check this
configuration and in which file it is defined?

3)* msg:Received alerts *logs are received when prometheus sends alerts to
alertmanager? msg:flushing* logs get *logged when? (below mail thread)





On Tue, Aug 31, 2021, 11:54 PM Stuart Clark <[email protected]>
wrote:

> In general I don't think you can. You can use the absent function, but
> that will also alert if the scrape fails (eg. Host is down). You could also
> use one of the over_time functions, but that wouldn't be based on the last
> sample before scrapes stopped.
>
> However, taking a step back, why do you want to keep alerting if scrapes
> fail? At that point you have no idea what is actually happening, so the
> server might be fine (but a firewall issue has stopped scrapes). You'd end
> up wasting time investing a server when you don't need to.
>
> Instead you'd generally have two alerts - on for your threshold and one
> for scrape failures. You'd then investigate whichever fires.
>
> On 31 August 2021 19:15:20 BST, akshay sharma <[email protected]>
> wrote:
>>
>> Okay,
>>
>> Use case is : I have an alert which should be raise when threshold value
>> mention crosses. Say, it is raised now, during next evaluation interval
>> ,the exporter/target is down, so Prometheus will get empty metrics and it
>> will mark the alert as resolve.
>>
>> I want alert to be in firing state only until the target comes back or
>> Prometheus gets the metrics back.
>>
>> How can I avoid this?
>>
>> On Tue, Aug 31, 2021, 7:49 PM Brian Candler <[email protected]> wrote:
>>
>>> Sorry, I don't understand what you're saying.  It's true that
>>> min_over_time works over a range vector, and it returns an instant vector.
>>> What's the problem?
>>>
>>> Alert will not be marked as firing if there is no time series returned
>>> by the "expr" in the alerting rule.  It is the presence or absence of a
>>> timeseries (not its value) which triggers an alert.
>>>
>>> Perhaps you can provide a concrete example of an alerting rule, and what
>>> the problem is (e.g. what behaviour you see that you don't want to see).
>>>
>>> On Tuesday, 31 August 2021 at 15:00:53 UTC+1 [email protected] wrote:
>>>
>>>> Thanks Brain,
>>>>
>>>> So, min_over_time is the only option to avoid that scenario(alert
>>>> resolving when metric gone missing ) ?because min_over_time is a range
>>>> vector and it will mark alert in firing state or any old state till that
>>>> range vector.
>>>> but what if service/exporter doesn't come for like 1d or sometimes in
>>>> 1h, who could we possibly make it work?
>>>>
>>>>
>>>> On Tue, Aug 31, 2021 at 1:39 PM Brian Candler <[email protected]>
>>>> wrote:
>>>>
>>>>> I can't answer your questions about endsAt because I've never had any
>>>>> problem with this, so I've never had to dig into it.  As long as your 
>>>>> alert
>>>>> rule evaluation interval is something sensible like 1m or 15s then it
>>>>> should be fine.  (This is defined at the level of the rule *group*, or the
>>>>> global evaluation_interval if not specified there)
>>>>>
>>>>> It sounds like the problem is that the alerting expression is
>>>>> resolving transiently - see also
>>>>> https://groups.google.com/g/prometheus-users/c/yVL8e257VvU
>>>>>
>>>>> To prove this, enter the alerting expression as-is into the PromQL
>>>>> browser:
>>>>>
>>>>> 100 -
>>>>> (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} *
>>>>> 100) > x%
>>>>>
>>>>> Graph it over time and look for small gaps.  And/or trim it to
>>>>>
>>>>> 100 -
>>>>> (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} *
>>>>> 100)
>>>>>
>>>>> and look for values which are above and below x% which could cause the
>>>>> alert to be resolved briefly before re-firing.
>>>>>
>>>>> There's no easy way around this except to make your alerting
>>>>> expressions more robust.  For example, suppose you're alerting on
>>>>>
>>>>> expr: up == 0
>>>>>
>>>>> but the value of "up" goes like this when a machine becomes
>>>>> overloaded:  1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 ...
>>>>>
>>>>> You're going to get a new alert every time it goes 0 1 0.  So instead
>>>>> you might write it like this:
>>>>>
>>>>> expr: min_over_time(up[10m]) == 0
>>>>>
>>>>> This will alert as soon as it goes to 0, but will only stop alerting
>>>>> when it has been at 1 for 10 minutes continuously (or the entire metric is
>>>>> missing for 10 minutes continuously).  This is straightforward, but
>>>>> unfortunately more sophisticated cases are tricky. Also, it's harder to
>>>>> cope with the case of occasional failures like 1 1 1 1 1 1 0 1 1 1 1 1 1
>>>>> without sending spurious alerts.  You could try building an expression 
>>>>> with
>>>>> count_over_time instead, or by joining on the ALERTS metric which exists
>>>>> for alerts that are already firing, although joining on metrics which 
>>>>> might
>>>>> not exist is not easy:
>>>>> https://www.robustperception.io/left-joins-in-promql
>>>>>
>>>>> Aside 1: the divide operator only generates results where the
>>>>> numerator and denominator have exactly the same set of labels, so your
>>>>> expression can be simplified to
>>>>>
>>>>> 100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes
>>>>> * 100) > x%
>>>>>
>>>>> without any change in behaviour.
>>>>>
>>>>> Aside 2: if the idea is to warn when a disk is *soon going to be full*
>>>>> then you can do something more sophisticated than static thresholds: e.g.
>>>>>
>>>>> - name: DiskRate3h
>>>>>   interval: 10m
>>>>>   rules:
>>>>>   # Warn if rate of growth over last 3 hours means filesystem will
>>>>> fill in 2 days
>>>>>   - alert: DiskFilling
>>>>>     expr: |
>>>>>
>>>>> predict_linear(node_filesystem_avail_bytes{instance!~"foo|bar|baz",fstype!~"fuse.*|nfs.*"}[3h],
>>>>> 2*86400) < 0
>>>>>     for: 6h
>>>>>     labels:
>>>>>       severity: warning
>>>>>     annotations:
>>>>>       summary: 'Filesystem will be full in less than 2d at current 3h
>>>>> growth rate'
>>>>>
>>>>> This means that a disk which is sitting at 91% full but unchanging
>>>>> won't alert, but one which goes from 91% to 92% to 93% over 3 hours will
>>>>> alert.  It's something to consider anyway, perhaps in conjunction with
>>>>> higher static thresholds so you get a quick alert when the filesystem
>>>>> reaches 98% (say).
>>>>>
>>>>> Remember that "alerts" in principle are things which may get someone
>>>>> out of bed, and should be something that can be immediately actioned, and
>>>>> resolved (or at worst silenced while the problem is fixed soon).  If you
>>>>> have alerts which are continuously firing, it leads to "alert fatigue" 
>>>>> very
>>>>> quickly. There's an excellent document here (by an ex-Google site
>>>>> reliability engineer):
>>>>> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
>>>>>
>>>>> --
>>>>>
>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/01badbf8-e167-4a81-8629-8e19d916fd34n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/01badbf8-e167-4a81-8629-8e19d916fd34n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/ffa84197-798c-498a-b8ff-d58f1369ec74n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/ffa84197-798c-498a-b8ff-d58f1369ec74n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAOrgXNKnuAbP7d7313C0KLL210BsjBv_ox5%2Bod_spAU_qscCVQ%40mail.gmail.com.

Re: [prometheus-users] Re: Understanding the parameter endsAt and resend-delay.

Reply via email to