Hi Aliaksandr, Thank you! Those numbers look interesting; we will give it a shot as well.
Thanks, Karthik On Sat, Oct 17, 2020 at 1:42 PM Aliaksandr Valialkin <[email protected]> wrote: > Hi Karthik, > > There is another option - to substitute Prometheus with VictoriaMetrics > stack <https://victoriametrics.github.io/>, which includes vmagent > <https://victoriametrics.github.io/vmagent.html> for data scraping and > vmalert <https://victoriametrics.github.io/vmalert.html> for alerting and > recording rules. It is optimized for high load, so it should require lower > amounts of resources compared to Prometheus. See, for example, this case > study <https://victoriametrics.github.io/CaseStudies.html#wixcom>. > > On Tue, Oct 13, 2020 at 2:23 PM Karthik Vijayaraju < > [email protected]> wrote: > >> Thank you! >> I will try this out with a newer version and experiment with hashmod. >> >> On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote: >> >>> Thanks, knowing what Prometheus version you're on helps a lot. There are >>> two things that will help setups like yours quite a lot. >>> >>> First, Prometheus 2.19 introduced some new memory management >>> improvements that mostly eliminates pod churn memory growth. It also >>> greatly improves memory use for high scrape frequencies. >>> >>> Second, 2.18.2 was the first official Prometheus version to be built >>> with Go 1.14. This introduced an issue affected the compression, and hence >>> the memory use of Prometheus. See >>> https://github.com/prometheus/prometheus/pull/7976. >>> >>> Once 2.22.0 is out, upgrading would be highly recommended. >>> >>> You might want to look at this Prometheus Operator issue about hashmod >>> sharding: >>> https://github.com/prometheus-operator/prometheus-operator/issues/2590 >>> >>> On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]> >>> wrote: >>> >>>> >>>> There are different services and each could scale to 1000+ pods in a >>>> given namespace. >>>> But even then managing a Prometheus instance pair per set of apps is >>>> not tenable. The management overhead would be too great when there are >>>> several such apps. >>>> >>>> Version wise, we are keeping up, but not aggressively. >>>> We are on 2.18.2 and the instance under test does not have Thanos. It >>>> only scrapes and does some rule evaluation (the memory usage is the same >>>> even when rule eval is disabled). >>>> We are using prometheus operator to reload config. >>>> >>>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so >>>> I am surprised that it consumes such a large amount. Will having a diverse >>>> scrape intervals have such an effect? >>>> >>>> Our stats at peak: >>>> ~15M head series >>>> ~45M head chunks >>>> ~475K samples/s ingested >>>> ~7000 pods scraped >>>> >>>> Thanks! >>>> >>>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected] >>>> wrote: >>>> >>>>> If all of the 1000s of pods in a namespace are of the same thing, you >>>>> can use the hashmod feature to horizontally scale. >>>>> >>>>> You can have several Prometheus instances per namespace, each >>>>> responsible for a fraction of the pods. >>>>> >>>>> Just to be sure, are you keeping up to date on the latest releases? >>>>> 200G of memory seems like a lot for 15M series. >>>>> >>>>> Are you using Thanos or a remote write service? >>>>> >>>>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We are hitting some limits with our current setup of Prometheus. I >>>>>> have read a lot of posts here as well as blogs and videos but still need >>>>>> some guidance. >>>>>> >>>>>> Our current setup is at it's limit. Head series count is around 15M >>>>>> during pod churn regularly. Each app exports between 5000 and 8000 >>>>>> metrics >>>>>> series. So a 1000 pods causes about 8M new series in the head block. >>>>>> Prometheus currently has access to 300 GB of memory, but it can't use >>>>>> past 200GB in reality. It starts degrading around the 150GB mark. >>>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config >>>>>> reloads fail. >>>>>> - We verified that this is not due to a cardinality explosion from a >>>>>> misbehaving app. So this has naturally degraded due to load. >>>>>> - We eliminated bad queries as a cause by spinning up an additional >>>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck >>>>>> is just ingestion. >>>>>> >>>>>> So the next step for us is to shard and use namespace level >>>>>> Prometheis. But I expect a similar level of usage in about an year again >>>>>> at >>>>>> the namespace level, with multiple apps in a single namespace scaling to >>>>>> 1000s of pods exporting 5K metrics each. And I will not be able to shard >>>>>> again because I don't want to go below the NS granularity. >>>>>> >>>>>> How have others dealt with this situation where is the bottle neck is >>>>>> going to be ingestion and not queries? >>>>>> >>>>>> Thanks for your time, >>>>>> KVR >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Prometheus Users" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com >> <https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > Best Regards, > > Aliaksandr Valialkin, CTO VictoriaMetrics > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbevGnk210qt6TeAAAqzqzQW1CzU6iG5fexOKh6ArVeEjEESA%40mail.gmail.com.

