On 28.04.21 23:55, [email protected] wrote: > hi > I have some batch process that push metrics to push-gateway. > the batch process runs every-day for ~2 minutes. > i also have alerts on the metrics, and it seems to work fine except to the > following scenario. > > when alert is fired, there is nothing that "clears" the alert from > Prometheus, and the alert manager keeps sending the notifications. the > alert can be cleared only after 24 hours, when the batch process is > triggered again. > > i was hoping that "resolve_timeout: 5m" will solve this, but it's not. > any idea how deal with such senario?
Most alerts (or you could say: well designed alerts) fire for as long as the alerting condition still applies. In your case, it sounds like if the daily job fails, the alert will fire for a day. What happens if you trigger an immediate re-run? Would it clear the alert if that re-run succeeds? If you can really only run the job once per day, I'd say it is "the right thing to do" to keep the alert firing until a run has finally succeeded. The usual workflow as an operator to say "I have seen the alert, and now I'm working on it, but I'm aware that it is still firing" is to place a silence for the exected time you need to fix the isse. -- Björn Rabenstein [PGP-ID] 0x851C3DA17D748D03 [email] [email protected] -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/20210504152019.GD2645%40jahnn.

