Hi Ben, We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves the problem. I currently have one of the two replicas in production crashlooping so I'll try to roll this out in the next few hours and report back.
Thanks, Vik On Wed, 1 Jul 2020 at 16:32, Ben Kochie <[email protected]> wrote: > What version of Prometheus do you have deployed? We've made several major > improvements to WAL handling and startup in the last couple of releases. > > I would recommend upgrading to 2.19.2 if you haven't. > > On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai <[email protected]> > wrote: > >> Hi all, >> >> We have a recurring problem with Prometheus repeatedly getting OOMKilled >> on startup while trying to process the write ahead log. I tried to look >> through Github issues but there was no solution or currently open issue as >> far as I could see. >> >> We are running on Kubernetes in GKE using the prometheus-operator Helm >> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24 >> hours maximum, so our Prometheus pods also get killed and automatically >> migrated by Kubernetes (the data is on a persistent volume of course). To >> avoid loss of metrics, we run two identically configured replicas with >> their own storage, scraping all the same targets. >> >> We monitor numerous GCE VMs that do batch processing, running anywhere >> between a few minutes to several hours. This workload is bursty, >> fluctuating between tens and hundreds of VMs active at any time, so >> sometimes the Prometheus wal folder grows to between 10-15GB in size. >> Prometheus usually handles this workload with about half a CPU core and 8GB >> of RAM and if left to its own devices, the wal folder will shrink again >> when the load decreases. >> >> The problem is that when there is a backlog and Prometheus is restarted >> (due to the preemptive VM going away), it will use several times more RAM >> to recover the wal folder. This often exhausts all the available memory on >> the Kubernetes worker, so Prometheus is killed by the OOM killed over and >> over again, until I log in and delete the wal folder, losing several hours >> of metrics. I have already doubled the size of the VMs just to accommodate >> Prometheus and I am reluctant to do this again. Running non-preemptive VMs >> would triple the cost of these instances and Prometheus might still get >> restarted when we roll out an update -- so this would probably not even >> solve the issue properly. >> >> I don't know if there is something special in our use case, but I did >> come across a blog describing the same high memory usage behaviour on >> startup. >> >> I feel that unless there is a fix I can do, this would warrant either a >> bug or feature request -- Prometheus should be able to recover without >> operator intervention or losing metrics. And for a process running on >> Kubernetes, we should be able to set memory "request" and "limit" values >> that are close to actual expected usage, rather than 3-4 times the steady >> state usage just to accommodate the memory requirements of the startup >> phase. >> >> Please let me know what information I should provide, if any. I have some >> graph screenshots that would be relevant. >> >> Many thanks, >> Vik >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- My other sig is hilarious -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CANx-tGgz631pR8LGDxWpw%2BGFQDAGezOWETWZAj4-n%3DSV-m3www%40mail.gmail.com.

