Re: [prometheus-users] Scaling Prometheus

Karthik Vijayaraju Tue, 20 Oct 2020 10:33:14 -0700

Hi Aliaksandr,

Thank you! Those numbers look interesting; we will give it a shot as well.


Thanks,
Karthik

On Sat, Oct 17, 2020 at 1:42 PM Aliaksandr Valialkin <[email protected]>
wrote:

> Hi Karthik,
>
> There is another option - to substitute Prometheus with VictoriaMetrics
> stack <https://victoriametrics.github.io/>, which includes vmagent
> <https://victoriametrics.github.io/vmagent.html> for data scraping and
> vmalert <https://victoriametrics.github.io/vmalert.html> for alerting and
> recording rules. It is optimized for high load, so it should require lower
> amounts of resources compared to Prometheus. See, for example, this case
> study <https://victoriametrics.github.io/CaseStudies.html#wixcom>.
>
> On Tue, Oct 13, 2020 at 2:23 PM Karthik Vijayaraju <
> [email protected]> wrote:
>
>> Thank you!
>> I will try this out with a newer version and experiment with hashmod.
>>
>> On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <[email protected]> wrote:
>>
>>> Thanks, knowing what Prometheus version you're on helps a lot. There are
>>> two things that will help setups like yours quite a lot.
>>>
>>> First, Prometheus 2.19 introduced some new memory management
>>> improvements that mostly eliminates pod churn memory growth. It also
>>> greatly improves memory use for high scrape frequencies.
>>>
>>> Second, 2.18.2 was the first official Prometheus version to be built
>>> with Go 1.14. This introduced an issue affected the compression, and hence
>>> the memory use of Prometheus. See
>>> https://github.com/prometheus/prometheus/pull/7976.
>>>
>>> Once 2.22.0 is out, upgrading would be highly recommended.
>>>
>>> You might want to look at this Prometheus Operator issue about hashmod
>>> sharding:
>>> https://github.com/prometheus-operator/prometheus-operator/issues/2590
>>>
>>> On Sun, Oct 11, 2020 at 10:14 PM kvr <[email protected]>
>>> wrote:
>>>
>>>>
>>>> There are different services and each could scale to 1000+ pods in a
>>>> given namespace.
>>>> But even then managing a Prometheus instance pair per set of apps is
>>>> not tenable. The management overhead would be too great when there are
>>>> several such apps.
>>>>
>>>> Version wise, we are keeping up, but not aggressively.
>>>> We are on 2.18.2 and the instance under test does not have Thanos. It
>>>> only scrapes and does some rule evaluation (the memory usage is the same
>>>> even when rule eval is disabled).
>>>> We are using prometheus operator to reload config.
>>>>
>>>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so
>>>> I am surprised that it consumes such a large amount.  Will having a diverse
>>>> scrape intervals have such an effect?
>>>>
>>>> Our stats at peak:
>>>> ~15M head series
>>>> ~45M head chunks
>>>> ~475K samples/s ingested
>>>> ~7000 pods scraped
>>>>
>>>> Thanks!
>>>>
>>>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 [email protected]
>>>> wrote:
>>>>
>>>>> If all of the 1000s of pods in a namespace are of the same thing, you
>>>>> can use the hashmod feature to horizontally scale.
>>>>>
>>>>> You can have several Prometheus instances per namespace, each
>>>>> responsible for a fraction of the pods.
>>>>>
>>>>> Just to be sure, are you keeping up to date on the latest releases?
>>>>> 200G of memory seems like a lot for 15M series.
>>>>>
>>>>> Are you using Thanos or a remote write service?
>>>>>
>>>>> On Sun, Oct 11, 2020, 07:14 kvr <[email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are hitting some limits with our current setup of Prometheus. I
>>>>>> have read a lot of posts here as well as blogs and videos but still need
>>>>>> some guidance.
>>>>>>
>>>>>> Our current setup is at it's limit. Head series count is around 15M
>>>>>> during pod churn regularly. Each app exports between 5000 and 8000 
>>>>>> metrics
>>>>>> series. So a 1000 pods causes about 8M new series in the head block.
>>>>>> Prometheus currently has access to 300 GB of memory, but it can't use
>>>>>> past 200GB in reality. It starts degrading around the 150GB mark.
>>>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config
>>>>>> reloads fail.
>>>>>> - We verified that this is not due to a cardinality explosion from a
>>>>>> misbehaving app. So this has naturally degraded due to load.
>>>>>> - We eliminated bad queries as a cause by spinning up an additional
>>>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck
>>>>>> is just ingestion.
>>>>>>
>>>>>> So the next step for us is to shard and use namespace level
>>>>>> Prometheis. But I expect a similar level of usage in about an year again 
>>>>>> at
>>>>>> the namespace level, with multiple apps in a single namespace scaling to
>>>>>> 1000s of pods exporting 5K metrics each. And I will not be able to shard
>>>>>> again because I don't want to go below  the NS granularity.
>>>>>>
>>>>>> How have others dealt with this situation where is the bottle neck is
>>>>>> going to be ingestion and not queries?
>>>>>>
>>>>>> Thanks for your time,
>>>>>> KVR
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Best Regards,
>
> Aliaksandr Valialkin, CTO VictoriaMetrics
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbevGnk210qt6TeAAAqzqzQW1CzU6iG5fexOKh6ArVeEjEESA%40mail.gmail.com.

Re: [prometheus-users] Scaling Prometheus

Reply via email to