Hello, One of the biggest challenges we have when trying to run Prometheus with a constantly growing number of scraped services is keeping resource usage under control. This usually means memory usage. Cardinality is often a huge problem and we often end up with services accidentally exposing labels that are risky. One silly mistake we see every now and then is putting raw errors as labels, which then leads to time series with {error="connection from $ip:$port to $ip:$port timed out"} and so on.
We had a lot of way of dealing with this that uses vanilla Prometheus features but none of it really works well for us. Obviously there is sample_limit that one might use here, but the biggest problem with it is the fact that once you hit sample_limit threshold you lose all metrics, and that's just not acceptable for us. If I have a service that exports 999 time series and it suddenly goes to 1001 (with sample_limit=1000) I really don't want to lose all metrics just because of that because losing all monitoring is bigger problem than having a few extra time series in Prometheus. It's just too risky. We're currently running Prometheus with patches from: https://github.com/prometheus/prometheus/pull/11124 This gives us 2 levels of protection: - global HEAD limit - Prometheus is not allowed to have more than M time series in TSDB - per scrape sample_limit - but patched so that if you exceed sample_limit it will start rejecting time series that aren't already in TSDB This works well for us and gives us a system that: - gives us reassurance that Prometheus won't start getting OOM killed overnight - service owners can add new metrics without fear that a typo will cost them all metrics But comments on that PR suggest that it's a highly controversial feature. I wanted to probe this community to see what the overall feeling is and how likely is that vanilla Prometheus will have something like this. It's a small patch so I'm happy to just maintain it for our internal deployments but it just feels like a common problem to me, so a baked in solution would be great. Lukasz -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com.