Re: [prometheus-users] Scaling Prometheus

Karthik Vijayaraju Tue, 13 Oct 2020 04:24:21 -0700

Thank you!
I will try this out with a newer version and experiment with hashmod.


On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote:

> Thanks, knowing what Prometheus version you're on helps a lot. There are
> two things that will help setups like yours quite a lot.
>
> First, Prometheus 2.19 introduced some new memory management improvements
> that mostly eliminates pod churn memory growth. It also greatly improves
> memory use for high scrape frequencies.
>
> Second, 2.18.2 was the first official Prometheus version to be built with
> Go 1.14. This introduced an issue affected the compression, and hence the
> memory use of Prometheus. See
> https://github.com/prometheus/prometheus/pull/7976.
>
> Once 2.22.0 is out, upgrading would be highly recommended.
>
> You might want to look at this Prometheus Operator issue about hashmod
> sharding:
> https://github.com/prometheus-operator/prometheus-operator/issues/2590
>
> On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]> wrote:
>
>>
>> There are different services and each could scale to 1000+ pods in a
>> given namespace.
>> But even then managing a Prometheus instance pair per set of apps is not
>> tenable. The management overhead would be too great when there are several
>> such apps.
>>
>> Version wise, we are keeping up, but not aggressively.
>> We are on 2.18.2 and the instance under test does not have Thanos. It
>> only scrapes and does some rule evaluation (the memory usage is the same
>> even when rule eval is disabled).
>> We are using prometheus operator to reload config.
>>
>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so I
>> am surprised that it consumes such a large amount.  Will having a diverse
>> scrape intervals have such an effect?
>>
>> Our stats at peak:
>> ~15M head series
>> ~45M head chunks
>> ~475K samples/s ingested
>> ~7000 pods scraped
>>
>> Thanks!
>>
>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected]
>> wrote:
>>
>>> If all of the 1000s of pods in a namespace are of the same thing, you
>>> can use the hashmod feature to horizontally scale.
>>>
>>> You can have several Prometheus instances per namespace, each
>>> responsible for a fraction of the pods.
>>>
>>> Just to be sure, are you keeping up to date on the latest releases? 200G
>>> of memory seems like a lot for 15M series.
>>>
>>> Are you using Thanos or a remote write service?
>>>
>>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are hitting some limits with our current setup of Prometheus. I have
>>>> read a lot of posts here as well as blogs and videos but still need some
>>>> guidance.
>>>>
>>>> Our current setup is at it's limit. Head series count is around 15M
>>>> during pod churn regularly. Each app exports between 5000 and 8000 metrics
>>>> series. So a 1000 pods causes about 8M new series in the head block.
>>>> Prometheus currently has access to 300 GB of memory, but it can't use
>>>> past 200GB in reality. It starts degrading around the 150GB mark.
>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config
>>>> reloads fail.
>>>> - We verified that this is not due to a cardinality explosion from a
>>>> misbehaving app. So this has naturally degraded due to load.
>>>> - We eliminated bad queries as a cause by spinning up an additional
>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck
>>>> is just ingestion.
>>>>
>>>> So the next step for us is to shard and use namespace level Prometheis.
>>>> But I expect a similar level of usage in about an year again at the
>>>> namespace level, with multiple apps in a single namespace scaling to 1000s
>>>> of pods exporting 5K metrics each. And I will not be able to shard again
>>>> because I don't want to go below  the NS granularity.
>>>>
>>>> How have others dealt with this situation where is the bottle neck is
>>>> going to be ingestion and not queries?
>>>>
>>>> Thanks for your time,
>>>> KVR
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com
>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com.

Re: [prometheus-users] Scaling Prometheus

Reply via email to