Thanks for the help Aliaksandr and Julien. I upgraded to latest Prometheus 2.25.0 and golang 1.15.8 and see a huge performance improvement.
On Tuesday, March 9, 2021 at 11:22:39 AM UTC-8 [email protected] wrote: > On Fri, Mar 5, 2021 at 12:10 AM Dhruv Patel <[email protected]> wrote: > >> Hi Folks, >> We are seeing an issue in our current Prometheus Setup where we are not >> able to ingest beyond 22 million metrics/min. We have run several Load Test >> at 25 Million, 29 Million and 35 Million but the ingestion rate remains >> constant around the same 22 million metrics/min. Moreover, we are also >> seeing that our CPU Usage is around 70% and have more than 50% memory >> available memory. Looking at this it feels like we are not hitting resource >> limitations but something to do with lock contention. >> >> *Prometheus Version:* 2.9.1 >> *Host Shape:* x7-enclave-104 (It is a bare metal host with 104 processor >> units). More info can be obtained in below screenshots >> *Memory Info: * >> total used free shared >> buff/cache available >> Mem: 754G 88G 528G 67M 136G >> 719G >> Swap: 1.0G 0B 1.0G >> Total: 755G 88G 529G >> >> We also ran some profiling during our load test setup at 20Million, 22 >> Million and 25 Million and have seen an increase in time taken taken for >> running runtime.mallocgc which leads to an increased usage in >> runtime.futex. Some how we are not able to figure out what could be the >> issue of the lock contention. I have attached our profiling results at >> different load test levels if thats any useful. Any ideas on what could be >> causing the high time taken in runtime malloc gc? >> > > Prometheus is written in Go. The runtime.mallocgc function is called every > time Prometheus allocates a new object during its operation. It looks like > Prometheus 2.9.1 allocates a lot during the load test. The runtime.futex is > used internally by Go runtime during objects' allocation and subsequent > objects' deallocation (aka garbage collection). It looks like the Go > runtime used in Prometheus 2.9.1 isn't optimized well for programs with > frequent object allocations that run on systems with many CPU cores. This > should be improved in Go 1.15 - Allocation of small objects now performs > much better at high core counts, and has lower worst-case latency > <https://tip.golang.org/doc/go1.15#runtime> . So it is recommended > repeating the load test on to the latest available version of Prometheus, > which is hopefully built with at least Go 1.15 - see > https://github.com/prometheus/prometheus/releases . > > Additionally, you can run the load test on VictoriaMetrics and compare its > scalability with Prometheus. See > https://victoriametrics.github.io/#how-to-scrape-prometheus-exporters-such-as-node-exporter > > . > > >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > Best Regards, > > Aliaksandr Valialkin, CTO VictoriaMetrics > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f54ae6b0-26ac-4ea2-a62c-48aa81aba0e1n%40googlegroups.com.

