Hi Karthik, There is another option - to substitute Prometheus with VictoriaMetrics stack <https://victoriametrics.github.io/>, which includes vmagent <https://victoriametrics.github.io/vmagent.html> for data scraping and vmalert <https://victoriametrics.github.io/vmalert.html> for alerting and recording rules. It is optimized for high load, so it should require lower amounts of resources compared to Prometheus. See, for example, this case study <https://victoriametrics.github.io/CaseStudies.html#wixcom>.
On Tue, Oct 13, 2020 at 2:23 PM Karthik Vijayaraju < [email protected]> wrote: > Thank you! > I will try this out with a newer version and experiment with hashmod. > > On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote: > >> Thanks, knowing what Prometheus version you're on helps a lot. There are >> two things that will help setups like yours quite a lot. >> >> First, Prometheus 2.19 introduced some new memory management improvements >> that mostly eliminates pod churn memory growth. It also greatly improves >> memory use for high scrape frequencies. >> >> Second, 2.18.2 was the first official Prometheus version to be built with >> Go 1.14. This introduced an issue affected the compression, and hence the >> memory use of Prometheus. See >> https://github.com/prometheus/prometheus/pull/7976. >> >> Once 2.22.0 is out, upgrading would be highly recommended. >> >> You might want to look at this Prometheus Operator issue about hashmod >> sharding: >> https://github.com/prometheus-operator/prometheus-operator/issues/2590 >> >> On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]> >> wrote: >> >>> >>> There are different services and each could scale to 1000+ pods in a >>> given namespace. >>> But even then managing a Prometheus instance pair per set of apps is not >>> tenable. The management overhead would be too great when there are several >>> such apps. >>> >>> Version wise, we are keeping up, but not aggressively. >>> We are on 2.18.2 and the instance under test does not have Thanos. It >>> only scrapes and does some rule evaluation (the memory usage is the same >>> even when rule eval is disabled). >>> We are using prometheus operator to reload config. >>> >>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so I >>> am surprised that it consumes such a large amount. Will having a diverse >>> scrape intervals have such an effect? >>> >>> Our stats at peak: >>> ~15M head series >>> ~45M head chunks >>> ~475K samples/s ingested >>> ~7000 pods scraped >>> >>> Thanks! >>> >>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected] >>> wrote: >>> >>>> If all of the 1000s of pods in a namespace are of the same thing, you >>>> can use the hashmod feature to horizontally scale. >>>> >>>> You can have several Prometheus instances per namespace, each >>>> responsible for a fraction of the pods. >>>> >>>> Just to be sure, are you keeping up to date on the latest releases? >>>> 200G of memory seems like a lot for 15M series. >>>> >>>> Are you using Thanos or a remote write service? >>>> >>>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are hitting some limits with our current setup of Prometheus. I >>>>> have read a lot of posts here as well as blogs and videos but still need >>>>> some guidance. >>>>> >>>>> Our current setup is at it's limit. Head series count is around 15M >>>>> during pod churn regularly. Each app exports between 5000 and 8000 metrics >>>>> series. So a 1000 pods causes about 8M new series in the head block. >>>>> Prometheus currently has access to 300 GB of memory, but it can't use >>>>> past 200GB in reality. It starts degrading around the 150GB mark. >>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config >>>>> reloads fail. >>>>> - We verified that this is not due to a cardinality explosion from a >>>>> misbehaving app. So this has naturally degraded due to load. >>>>> - We eliminated bad queries as a cause by spinning up an additional >>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck >>>>> is just ingestion. >>>>> >>>>> So the next step for us is to shard and use namespace level >>>>> Prometheis. But I expect a similar level of usage in about an year again >>>>> at >>>>> the namespace level, with multiple apps in a single namespace scaling to >>>>> 1000s of pods exporting 5K metrics each. And I will not be able to shard >>>>> again because I don't want to go below the NS granularity. >>>>> >>>>> How have others dealt with this situation where is the bottle neck is >>>>> going to be ingestion and not queries? >>>>> >>>>> Thanks for your time, >>>>> KVR >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com > <https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- Best Regards, Aliaksandr Valialkin, CTO VictoriaMetrics -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAPbKnmB%3DpNwhMPqKCMR28%2B4LEJw4002Ev7pXaHx%3DsavD5Fs9xw%40mail.gmail.com.

