> I'm working with a metric like CPU usage, where instance identifiers
> are submitted as labels. To ensure instances are running as expected,
> I've defined an alert based on this metric. The alert triggers when
> the aggregation value (in my case, the increase) over a time window
> falls below an expected threshold. By utilizing the instance
> identifier as a label, I've streamlined the alert definition to one.
>
> So far, I've been successful in achieving this. However, I'm grappling
> with how to handle instances that have been intentionally shut down.
> Since the metric value for these instances remains static, the alert
> consistently fires.

I think it may depend on how you're collecting these metrics. In
general, the best way to collect per-instance metrics is to have
Prometheus directly scrape them from a target that will stop responding
or go away if the instance does. When a scrape fails, Prometheus
immediately marks all metrics it supplies as stale, and I believe that
this also happens when a scrape target is removed (for example through
service discovery not listing it any more). Metrics that are known to be
stale no longer show up in rate() and other things, so normally they
automatically don't trigger such alerts (and any active alert for such a
target will go away when it's removed).

If you're collecting these metrics in a way that makes them stuck after
the instance goes away (the classical case is publishing them through
Pushgateway), then either you need an additional 'is this instance
alive' check in your alerts or you need some additional system to delete
metrics from now-removed instance from wherever they're getting
published. If you have control over the complete metrics you're
publishing, one option is to publish a last-updated metric and then only
alert if the last-updated metric is recent enough. In many cases you can
arrange for this metric to have the same labels as your other metrics,
so you can just add something like 'and ((time() - metric) < 120)' to
your alert rule. If the labels are different, you'll need to get more
creative.

(Conveniently Pushgateway already provides such a metric for each group,
in 'push_time_seconds'. However it may not have all the labels as the
metric you're alerting on.)

        - cks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3271539.1701873957%40apps0.cs.toronto.edu.

Reply via email to