Thank you! I will try this out with a newer version and experiment with hashmod.
On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote: > Thanks, knowing what Prometheus version you're on helps a lot. There are > two things that will help setups like yours quite a lot. > > First, Prometheus 2.19 introduced some new memory management improvements > that mostly eliminates pod churn memory growth. It also greatly improves > memory use for high scrape frequencies. > > Second, 2.18.2 was the first official Prometheus version to be built with > Go 1.14. This introduced an issue affected the compression, and hence the > memory use of Prometheus. See > https://github.com/prometheus/prometheus/pull/7976. > > Once 2.22.0 is out, upgrading would be highly recommended. > > You might want to look at this Prometheus Operator issue about hashmod > sharding: > https://github.com/prometheus-operator/prometheus-operator/issues/2590 > > On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]> wrote: > >> >> There are different services and each could scale to 1000+ pods in a >> given namespace. >> But even then managing a Prometheus instance pair per set of apps is not >> tenable. The management overhead would be too great when there are several >> such apps. >> >> Version wise, we are keeping up, but not aggressively. >> We are on 2.18.2 and the instance under test does not have Thanos. It >> only scrapes and does some rule evaluation (the memory usage is the same >> even when rule eval is disabled). >> We are using prometheus operator to reload config. >> >> Yeah, I read that ~2GB of memory is sufficient per million metrics, so I >> am surprised that it consumes such a large amount. Will having a diverse >> scrape intervals have such an effect? >> >> Our stats at peak: >> ~15M head series >> ~45M head chunks >> ~475K samples/s ingested >> ~7000 pods scraped >> >> Thanks! >> >> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected] >> wrote: >> >>> If all of the 1000s of pods in a namespace are of the same thing, you >>> can use the hashmod feature to horizontally scale. >>> >>> You can have several Prometheus instances per namespace, each >>> responsible for a fraction of the pods. >>> >>> Just to be sure, are you keeping up to date on the latest releases? 200G >>> of memory seems like a lot for 15M series. >>> >>> Are you using Thanos or a remote write service? >>> >>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> We are hitting some limits with our current setup of Prometheus. I have >>>> read a lot of posts here as well as blogs and videos but still need some >>>> guidance. >>>> >>>> Our current setup is at it's limit. Head series count is around 15M >>>> during pod churn regularly. Each app exports between 5000 and 8000 metrics >>>> series. So a 1000 pods causes about 8M new series in the head block. >>>> Prometheus currently has access to 300 GB of memory, but it can't use >>>> past 200GB in reality. It starts degrading around the 150GB mark. >>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config >>>> reloads fail. >>>> - We verified that this is not due to a cardinality explosion from a >>>> misbehaving app. So this has naturally degraded due to load. >>>> - We eliminated bad queries as a cause by spinning up an additional >>>> Prometheus which just scrapes targets and nothing else. So the bottleneck >>>> is just ingestion. >>>> >>>> So the next step for us is to shard and use namespace level Prometheis. >>>> But I expect a similar level of usage in about an year again at the >>>> namespace level, with multiple apps in a single namespace scaling to 1000s >>>> of pods exporting 5K metrics each. And I will not be able to shard again >>>> because I don't want to go below the NS granularity. >>>> >>>> How have others dealt with this situation where is the bottle neck is >>>> going to be ingestion and not queries? >>>> >>>> Thanks for your time, >>>> KVR >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com >>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com >> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com.

