Hi guys,

I have a small problem and asking some advice on how do I proceed with my 
setup.

So we are using Prometheus together with Opsgenie for incident management 
and we are not using any grouping in Prometheus and just let Opsgenie do 
it's thing (like deduplication) and generally speaking, all works fine.

There is one thing that I would like to improve.
Whenever the alert is triggered because some metric went over the 
threshold, it gets triggered almost instantly, and that's fine with us, 
however when the alert gets resolved, I don't want to resolve it right 
away, instead I would like there to be a so called "recovery" period, where 
if that alert doesn't go over the threshold in the next 5 minutes after it 
"factually got resolved". Then and only then the alert will be resolved.

So to put it short, I want it fast to trigger (according to the interval 
setting of the alert rule, and that is working) and slow to recover, take 
an extra ~5 minutes to check that this alert is not coming back before 
resolving it.

Problem is that right now during some incidents, we have a certain metric 
that varies quite a bit and it might be resolved now, but in a minute or 
two it will reoccur because metric went over the threshold again.

I have these grouping values set

group_interval: 1s
group_wait: 1s
group_by: ['...']

Is there something you could advise me to improve this situation?
I am kind of lost here and I haven't found this functionality in neither 
prometheus or alertmanager.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/61c011dd-4f9d-48eb-8710-d54f537a5223n%40googlegroups.com.

Reply via email to