Look at Status > TSDB Status from the web interface of both systems. In particular, what does the first entry ("Head Stats") show for each system?
Do you have any idea of series churn, i.e. how many new series are being created and deleted per hour? (Although if you're scraping a subset of the same targets on non-prod, then it shouldn't be any worse) Prometheus exposes stats about its internal memory usage (go_memstats_*), can you see any difference between the two systems here? Are you hitting the non-production system with queries? If so, can you try not querying it for a while? Otherwise, you can try replicating the production system *exactly* in the non-production one: same binaries, same configuration, same retention. If it works differently then it's something about the environment. I observe that NAS is *not* recommended as a storage backend for prometheus. See https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects (scroll to the yellow "CAUTION" box) On Friday, 15 October 2021 at 15:02:33 UTC+1 tass...@gmail.com wrote: > We have been running prometheus for 3 or 4 years now. In production we > have 6 month retention and in non-production we have a retention of 45 > days. in production we are capturing 1.8 million metrics with 2,300 > targets. In non-prod we are capturing 800 K metrics with 2,200 targets. > The configuration is the same between the environments. Both production > and non-prod servers have 4 CPU and 24 GB of memory. Production is using > 160% CPU and 5,5 GB of memory. Non Prod is running out of memory even > after increasing the server memory to 64 GB. This seemed to happen after > patching non-prod to 2.30.3. Production is on 2.30.0. We are using NAS > storage. Non-prod is 500GB and production is 4 TB of storage. > > I have been doing several test in non production to isolate the issue to > see if it is an issue with the number of targets or the storage. I hav > tried reducing the targets, and the retention time. The results seem to be > the same between 2.30.3 and 2.30.0. > > prometheus-2.30.3 > 53MB No tagets clean storage > 41GB No tagets storage history > 5.5GB tagets clean storage > 42GB tagets storage history > > prometheus-2.30.0 > 2MB No tagets clean storage > 50GB No tagets storage history > 4GB tagets clean storage > 47GB tagets storage history > > With less retention than production, non-prod with no targets is using 10x > the memory as production even on the same hardware. After adding targets, > even with no history, the memory increase in non-prod until the OS kill > prometheus due to out of memory. I have increased the server from 24 GB to > 32 GB to 64GB and prometheus memory never stabilizes. I have tried > removing target and it does seem to help. > > There appears to be some sort of memory leak, but it is never aliens until > it is aliens. We are scraping most metrics every 15 second in production > and have changed non-prod to every 30 second with the same results. We are > using consul for service discovery. Not sure what else to look at. Any > suggestion on what to look at next? > > This is my first time posting. So I figured I would ask the community > rather than submitting a bug in githob. > > > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/61229289-872e-425e-83c2-7211f98f08ecn%40googlegroups.com.