On 28.04.21 23:55, [email protected] wrote:
> hi
> I have some batch process that push metrics to push-gateway. 
> the batch process runs every-day for ~2 minutes.
> i also have alerts on the metrics, and it seems to work fine except to the 
> following scenario.
> 
> when alert is fired, there is nothing that "clears" the alert from 
> Prometheus, and the alert manager keeps sending the notifications. the 
> alert can be cleared only after 24 hours, when the batch process is 
> triggered again.
> 
> i was hoping that "resolve_timeout: 5m" will solve this, but it's not.
> any idea how deal with such senario?

Most alerts (or you could say: well designed alerts) fire for as long
as the alerting condition still applies. In your case, it sounds like
if the daily job fails, the alert will fire for a day.

What happens if you trigger an immediate re-run? Would it clear the
alert if that re-run succeeds?

If you can really only run the job once per day, I'd say it is "the
right thing to do" to keep the alert firing until a run has finally
succeeded.

The usual workflow as an operator to say "I have seen the alert, and
now I'm working on it, but I'm aware that it is still firing" is to
place a silence for the exected time you need to fix the isse.

-- 
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] [email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/20210504152019.GD2645%40jahnn.

Reply via email to