Thank you both for taking the time to answer my questions!

The main use case I've been thinking is being able to differentiate between 
flapping alerts in the alert generator (Prometheus) and flapping alerts in 
the alert receiver (Alertmanager). In the former, the alert is flapping 
because the data is alternating around the condition without stabilizing. 
In the latter case, the alert generator is failing to keep the alert 
receiver informed about the state of the alert before its expiration time 
(EndsAt).

In either case I'm not proposing for alerts to be more responsive to 
flapping, however based on what I've learned about Prometheus and 
Alertmanager so far, and the answers above, being able to differentiate 
between the two is not a goal of Prometheus, but rather the opposite - to 
make them look the same.

> For example, this is important for the Alertmanager to see alerts from 
multiple Prometheus servers as identical if they have the same label set, 
even if they began and were resolved at slightly different times.

Indeed! The other use case is if we can make it easier to debug cases of 
flapping alerts, including when there are multiple Prometheus servers 
sending alerts to an Alertmanager. The motivation here is I've been 
debugging a number of cases of flapping alerts and it can be hard to 
understand where the flapping is coming from.

> An "alert" in that sense is different from an "incident" or particular 
time-based instance of an alert, which Prometheus does not explicitly 
model. The closest thing to that is the Alertmanager taking in varying 
alert states over time and turning them into discrete notifications while 
applying throttling and grouping mechanisms. Those can prevent some 
flapping on the notification front, and careful alerting rules (averaging 
over large enough durations, using "for" durations, etc.) can do their part 
as well.

Thanks for the explanation here! I think this was the main design choice I 
wanted to understand.

But, I have to ask the question of what is the purpose of Prometheus 
sending a StartsAt time to Alertmanager? This creates a time-based instance 
of an alert because alerts have definitive StartsAt times, so Prometheus is 
kind of modelling time-based alerts - and also not modelling time-based 
alerts all at the same time.

I think the StartsAt time of an alert can also go backwards when running 
Prometheus HA because different Prometheus servers will have different 
offsets for the same evaluation group depending on when the Prometheus 
process first started.
On Tuesday, June 27, 2023 at 9:55:49 AM UTC+3 Julius Volz wrote:

> Yeah, everything in Prometheus and the Alertmanager revolves around alerts 
> only being identified by their label sets and nothing else. For example, 
> this is important for the Alertmanager to see alerts from multiple 
> Prometheus servers as identical if they have the same label set, even if 
> they began and were resolved at slightly different times.
>
> An "alert" in that sense is different from an "incident" or particular 
> time-based instance of an alert, which Prometheus does not explicitly 
> model. The closest thing to that is the Alertmanager taking in varying 
> alert states over time and turning them into discrete notifications while 
> applying throttling and grouping mechanisms. Those can prevent some 
> flapping on the notification front, and careful alerting rules (averaging 
> over large enough durations, using "for" durations, etc.) can do their part 
> as well.
>
> On Fri, Jun 23, 2023 at 9:34 AM Matthias Rampke <matt...@prometheus.io> 
> wrote:
>
>> For a very long time, Prometheus did not store apeet state across 
>> restarts, so the alert startsAt would update even though the condition had 
>> not changed.
>>
>> I don't think we ever considered this time to be very meaningful or 
>> stable, partially due to the originally stateless implementation, but also 
>> due to the HA synchronization issue you mentioned.
>>
>> Can you explain more what the scenario is where the current label-based 
>> identity doesn't work? If I am reading it right, this is the first time 
>> someone asks for the alerts to be more responsive to flapping, more 
>> typically the desire is to reduce that, identifying successive alerts as 
>> being the same thing even if the alert condition wasn't held for a 
>> short period of time.
>>
>> /MR
>>
>> On Tue, 20 Jun 2023, 15:14 'George Robinson' via Prometheus Developers, <
>> prometheus...@googlegroups.com> wrote:
>>
>>> In prometheus/common the fingerprint of an alert is calculated as an 
>>> fnv64a hash of it's labels. The labels are first sorted, and then the label 
>>> name, separator, label value, and another separator for each label is added 
>>> to the hash before the final sum is calculated.period of time
>>>
>>> I noticed that something missing from the fingerprint is the alert's 
>>> StartsAt time. You could argue that an alert with labels a₁a₂aₙ that 
>>> started at time t₁ and then resolved at time t₂ is a different alert than 
>>> one also with the labels a₁a₂aₙ but started at time t₃ - and so these two 
>>> alerts should have different fingerprints.
>>>
>>> The fact that the fingerprint is constant over its labels has proven 
>>> interesting while debugging cases of flapping alerts in Alertmanager.
>>>
>>> However, while I would like to add StartsAt to the fingerprint, I am 
>>> concerned that adding the StartsAt timestamp to the fingerprint will break 
>>> Prometheus rules when run in HA as I do not believe the StartsAt time is 
>>> synchronised across rulers.
>>>
>>> I was wondering if there is some historical context for this? Perhaps 
>>> the reasons mentioned above, but there could be others that I am also 
>>> unaware of?
>>>
>>> Best regards
>>>
>>> George
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to prometheus-devel...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-developers/b722e00e-bfd8-4ff4-bbef-e5e0836280bbn%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-developers/b722e00e-bfd8-4ff4-bbef-e5e0836280bbn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-devel...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gYviO9Ad%3DJXrHKHdZzMDgnu545rTEtPi_hnALvyWzaxAA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gYviO9Ad%3DJXrHKHdZzMDgnu545rTEtPi_hnALvyWzaxAA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
> Julius Volz
> PromLabs - promlabs.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/33705996-f593-4d64-8961-badeed2a36f5n%40googlegroups.com.

Reply via email to