[prometheus-users] Prometheus / Alertmanager "Recovery" period

Vitali Raikov Wed, 06 Jan 2021 04:50:18 -0800

Hi guys,

I have a small problem and asking some advice on how do I proceed with my 
setup.

So we are using Prometheus together with Opsgenie for incident management
and we are not using any grouping in Prometheus and just let Opsgenie do
it's thing (like deduplication) and generally speaking, all works fine.

There is one thing that I would like to improve.
Whenever the alert is triggered because some metric went over the
threshold, it gets triggered almost instantly, and that's fine with us,
however when the alert gets resolved, I don't want to resolve it right
away, instead I would like there to be a so called "recovery" period, where
if that alert doesn't go over the threshold in the next 5 minutes after it
"factually got resolved". Then and only then the alert will be resolved.

So to put it short, I want it fast to trigger (according to the interval
setting of the alert rule, and that is working) and slow to recover, take
an extra ~5 minutes to check that this alert is not coming back before
resolving it.

Problem is that right now during some incidents, we have a certain metric
that varies quite a bit and it might be resolved now, but in a minute or
two it will reoccur because metric went over the threshold again.

I have these grouping values set

group_interval: 1s
group_wait: 1s
group_by: ['...']

Is there something you could advise me to improve this situation?
I am kind of lost here and I haven't found this functionality in neither
prometheus or alertmanager.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/61c011dd-4f9d-48eb-8710-d54f537a5223n%40googlegroups.com.

[prometheus-users] Prometheus / Alertmanager "Recovery" period

Reply via email to