Hello,

One of the biggest challenges we have when trying to run Prometheus with a 
constantly growing number of scraped services is keeping resource usage 
under control.
This usually means memory usage.
Cardinality is often a huge problem and we often end up with services 
accidentally exposing labels that are risky. One silly mistake we see every 
now and then is putting raw errors as labels, which then leads to time 
series with {error="connection from $ip:$port to $ip:$port timed out"} and 
so on.

We had a lot of way of dealing with this that uses vanilla Prometheus 
features but none of it really works well for us.
Obviously there is sample_limit that one might use here, but the biggest 
problem with it is the fact that once you hit sample_limit threshold you 
lose all metrics, and that's just not acceptable for us.
If I have a service that exports 999 time series and it suddenly goes to 
1001 (with sample_limit=1000) I really don't want to lose all metrics just 
because of that because losing all monitoring is bigger problem than having 
a few extra time series in Prometheus. It's just too risky.

We're currently running Prometheus with patches from:
https://github.com/prometheus/prometheus/pull/11124

This gives us 2 levels of protection:
- global HEAD limit - Prometheus is not allowed to have more than M time 
series in TSDB
- per scrape sample_limit - but patched so that if you exceed sample_limit 
it will start rejecting time series that aren't already in TSDB

This works well for us and gives us a system that:
- gives us reassurance that Prometheus won't start getting OOM killed 
overnight
- service owners can add new metrics without fear that a typo will cost 
them all metrics

But comments on that PR suggest that it's a highly controversial feature.
I wanted to probe this community to see what the overall feeling is and how 
likely is that vanilla Prometheus will have something like this.
It's a small patch so I'm happy to just maintain it for our internal 
deployments but it just feels like a common problem to me, so a baked in 
solution would be great.

Lukasz

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com.

Reply via email to