Hi guys, I have a small problem and asking some advice on how do I proceed with my setup.
So we are using Prometheus together with Opsgenie for incident management and we are not using any grouping in Prometheus and just let Opsgenie do it's thing (like deduplication) and generally speaking, all works fine. There is one thing that I would like to improve. Whenever the alert is triggered because some metric went over the threshold, it gets triggered almost instantly, and that's fine with us, however when the alert gets resolved, I don't want to resolve it right away, instead I would like there to be a so called "recovery" period, where if that alert doesn't go over the threshold in the next 5 minutes after it "factually got resolved". Then and only then the alert will be resolved. So to put it short, I want it fast to trigger (according to the interval setting of the alert rule, and that is working) and slow to recover, take an extra ~5 minutes to check that this alert is not coming back before resolving it. Problem is that right now during some incidents, we have a certain metric that varies quite a bit and it might be resolved now, but in a minute or two it will reoccur because metric went over the threshold again. I have these grouping values set group_interval: 1s group_wait: 1s group_by: ['...'] Is there something you could advise me to improve this situation? I am kind of lost here and I haven't found this functionality in neither prometheus or alertmanager. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/61c011dd-4f9d-48eb-8710-d54f537a5223n%40googlegroups.com.

