> I'm working with a metric like CPU usage, where instance identifiers > are submitted as labels. To ensure instances are running as expected, > I've defined an alert based on this metric. The alert triggers when > the aggregation value (in my case, the increase) over a time window > falls below an expected threshold. By utilizing the instance > identifier as a label, I've streamlined the alert definition to one. > > So far, I've been successful in achieving this. However, I'm grappling > with how to handle instances that have been intentionally shut down. > Since the metric value for these instances remains static, the alert > consistently fires.
I think it may depend on how you're collecting these metrics. In general, the best way to collect per-instance metrics is to have Prometheus directly scrape them from a target that will stop responding or go away if the instance does. When a scrape fails, Prometheus immediately marks all metrics it supplies as stale, and I believe that this also happens when a scrape target is removed (for example through service discovery not listing it any more). Metrics that are known to be stale no longer show up in rate() and other things, so normally they automatically don't trigger such alerts (and any active alert for such a target will go away when it's removed). If you're collecting these metrics in a way that makes them stuck after the instance goes away (the classical case is publishing them through Pushgateway), then either you need an additional 'is this instance alive' check in your alerts or you need some additional system to delete metrics from now-removed instance from wherever they're getting published. If you have control over the complete metrics you're publishing, one option is to publish a last-updated metric and then only alert if the last-updated metric is recent enough. In many cases you can arrange for this metric to have the same labels as your other metrics, so you can just add something like 'and ((time() - metric) < 120)' to your alert rule. If the labels are different, you'll need to get more creative. (Conveniently Pushgateway already provides such a metric for each group, in 'push_time_seconds'. However it may not have all the labels as the metric you're alerting on.) - cks -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3271539.1701873957%40apps0.cs.toronto.edu.