[prometheus-users] Re: Expire silence while alert became resolved

[email protected] Fri, 27 Nov 2020 03:03:56 -0800

During maintenance, alerts tend to bounce up and down, so for scheduled 
maintenance I wouldn't want a resolved alert to clear the silence 
automatically.  Rather, there should be a process to remove the silences 
when the maintenance has been confirmed as complete (by the engineers).  
For example, you can include the ticket number in the silence (or the 
silence ID in the ticket), and delete it when the ticket is closed.


I do sympathise with the use case: "X has gone down, a ticket has been 
raised for X, don't bug me about X again".  This might be done by setting a 
long-duration silence, say 2 weeks - but if the problem is fixed before 
then, you do want to start raising alarms again.  Again, if the silence is 
explicitly linked to a ticket, then closing the ticket can delete the 
silence.

If you don't have this ticket linkage, then it might be useful to have an 
external program which monitors the alertmanager, and notices if a given 
silence has been covering zero alerts for an extended period of time (say 6 
hours), and either flags it up for attention, or expires it automatically.

Remember that depending on how it's created, one silence can cover multiple 
active alerts.  If it covers several alerts, and a subset have resolved, 
you couldn't just delete the silence.  You'd have to replace it with a more 
specific matching silence to cover the remaining alerts which are still 
active.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6d5f7b23-d187-4319-84db-a2513393d0e3n%40googlegroups.com.

[prometheus-users] Re: Expire silence while alert became resolved

Reply via email to