Hello, We are hitting some limits with our current setup of Prometheus. I have read a lot of posts here as well as blogs and videos but still need some guidance.
Our current setup is at it's limit. Head series count is around 15M during pod churn regularly. Each app exports between 5000 and 8000 metrics series. So a 1000 pods causes about 8M new series in the head block. Prometheus currently has access to 300 GB of memory, but it can't use past 200GB in reality. It starts degrading around the 150GB mark. - Scrape time for Prometheus scraping itself is 5+ seconds and config reloads fail. - We verified that this is not due to a cardinality explosion from a misbehaving app. So this has naturally degraded due to load. - We eliminated bad queries as a cause by spinning up an additional Prometheus which just scrapes targets and nothing else. So the bottleneck is just ingestion. So the next step for us is to shard and use namespace level Prometheis. But I expect a similar level of usage in about an year again at the namespace level, with multiple apps in a single namespace scaling to 1000s of pods exporting 5K metrics each. And I will not be able to shard again because I don't want to go below the NS granularity. How have others dealt with this situation where is the bottle neck is going to be ingestion and not queries? Thanks for your time, KVR -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com.

