Hello,

We are hitting some limits with our current setup of Prometheus. I have 
read a lot of posts here as well as blogs and videos but still need some 
guidance.

Our current setup is at it's limit. Head series count is around 15M during 
pod churn regularly. Each app exports between 5000 and 8000 metrics series. 
So a 1000 pods causes about 8M new series in the head block. 
Prometheus currently has access to 300 GB of memory, but it can't use past 
200GB in reality. It starts degrading around the 150GB mark. 
- Scrape time for Prometheus scraping itself is 5+ seconds and config 
reloads fail.
- We verified that this is not due to a cardinality explosion from a 
misbehaving app. So this has naturally degraded due to load.
- We eliminated bad queries as a cause by spinning up an additional 
Prometheus which just scrapes targets and nothing else. So the bottleneck 
is just ingestion. 

So the next step for us is to shard and use namespace level Prometheis. But 
I expect a similar level of usage in about an year again at the namespace 
level, with multiple apps in a single namespace scaling to 1000s of pods 
exporting 5K metrics each. And I will not be able to shard again because I 
don't want to go below  the NS granularity. 

How have others dealt with this situation where is the bottle neck is going 
to be ingestion and not queries?

Thanks for your time,
KVR

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com.

Reply via email to