Re: [prometheus-users] Scaling Prometheus

Aliaksandr Valialkin Sat, 17 Oct 2020 01:13:01 -0700

Hi Karthik,

There is another option - to substitute Prometheus with VictoriaMetrics
stack <https://victoriametrics.github.io/>, which includes vmagent
<https://victoriametrics.github.io/vmagent.html> for data scraping and
vmalert <https://victoriametrics.github.io/vmalert.html> for alerting and
recording rules. It is optimized for high load, so it should require lower
amounts of resources compared to Prometheus. See, for example, this case
study <https://victoriametrics.github.io/CaseStudies.html#wixcom>.


On Tue, Oct 13, 2020 at 2:23 PM Karthik Vijayaraju <
[email protected]> wrote:

> Thank you!
> I will try this out with a newer version and experiment with hashmod.
>
> On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote:
>
>> Thanks, knowing what Prometheus version you're on helps a lot. There are
>> two things that will help setups like yours quite a lot.
>>
>> First, Prometheus 2.19 introduced some new memory management improvements
>> that mostly eliminates pod churn memory growth. It also greatly improves
>> memory use for high scrape frequencies.
>>
>> Second, 2.18.2 was the first official Prometheus version to be built with
>> Go 1.14. This introduced an issue affected the compression, and hence the
>> memory use of Prometheus. See
>> https://github.com/prometheus/prometheus/pull/7976.
>>
>> Once 2.22.0 is out, upgrading would be highly recommended.
>>
>> You might want to look at this Prometheus Operator issue about hashmod
>> sharding:
>> https://github.com/prometheus-operator/prometheus-operator/issues/2590
>>
>> On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]>
>> wrote:
>>
>>>
>>> There are different services and each could scale to 1000+ pods in a
>>> given namespace.
>>> But even then managing a Prometheus instance pair per set of apps is not
>>> tenable. The management overhead would be too great when there are several
>>> such apps.
>>>
>>> Version wise, we are keeping up, but not aggressively.
>>> We are on 2.18.2 and the instance under test does not have Thanos. It
>>> only scrapes and does some rule evaluation (the memory usage is the same
>>> even when rule eval is disabled).
>>> We are using prometheus operator to reload config.
>>>
>>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so I
>>> am surprised that it consumes such a large amount.  Will having a diverse
>>> scrape intervals have such an effect?
>>>
>>> Our stats at peak:
>>> ~15M head series
>>> ~45M head chunks
>>> ~475K samples/s ingested
>>> ~7000 pods scraped
>>>
>>> Thanks!
>>>
>>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected]
>>> wrote:
>>>
>>>> If all of the 1000s of pods in a namespace are of the same thing, you
>>>> can use the hashmod feature to horizontally scale.
>>>>
>>>> You can have several Prometheus instances per namespace, each
>>>> responsible for a fraction of the pods.
>>>>
>>>> Just to be sure, are you keeping up to date on the latest releases?
>>>> 200G of memory seems like a lot for 15M series.
>>>>
>>>> Are you using Thanos or a remote write service?
>>>>
>>>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are hitting some limits with our current setup of Prometheus. I
>>>>> have read a lot of posts here as well as blogs and videos but still need
>>>>> some guidance.
>>>>>
>>>>> Our current setup is at it's limit. Head series count is around 15M
>>>>> during pod churn regularly. Each app exports between 5000 and 8000 metrics
>>>>> series. So a 1000 pods causes about 8M new series in the head block.
>>>>> Prometheus currently has access to 300 GB of memory, but it can't use
>>>>> past 200GB in reality. It starts degrading around the 150GB mark.
>>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config
>>>>> reloads fail.
>>>>> - We verified that this is not due to a cardinality explosion from a
>>>>> misbehaving app. So this has naturally degraded due to load.
>>>>> - We eliminated bad queries as a cause by spinning up an additional
>>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck
>>>>> is just ingestion.
>>>>>
>>>>> So the next step for us is to shard and use namespace level
>>>>> Prometheis. But I expect a similar level of usage in about an year again 
>>>>> at
>>>>> the namespace level, with multiple apps in a single namespace scaling to
>>>>> 1000s of pods exporting 5K metrics each. And I will not be able to shard
>>>>> again because I don't want to go below  the NS granularity.
>>>>>
>>>>> How have others dealt with this situation where is the bottle neck is
>>>>> going to be ingestion and not queries?
>>>>>
>>>>> Thanks for your time,
>>>>> KVR
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com
> <https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAPbKnmB%3DpNwhMPqKCMR28%2B4LEJw4002Ev7pXaHx%3DsavD5Fs9xw%40mail.gmail.com.

Re: [prometheus-users] Scaling Prometheus

Reply via email to