Re: [prometheus-users] Prometheus wal folder and memory usage on startup

Viktor Radnai Wed, 01 Jul 2020 08:39:56 -0700

Hi Ben,

We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves
the problem. I currently have one of the two replicas in production
crashlooping so I'll try to roll this out in the next few hours and report
back.


Thanks,
Vik

On Wed, 1 Jul 2020 at 16:32, Ben Kochie <[email protected]> wrote:

> What version of Prometheus do you have deployed? We've made several major
> improvements to WAL handling and startup in the last couple of releases.
>
> I would recommend upgrading to 2.19.2 if you haven't.
>
> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai <[email protected]>
> wrote:
>
>> Hi all,
>>
>> We have a recurring problem with Prometheus repeatedly getting OOMKilled
>> on startup while trying to process the write ahead log. I tried to look
>> through Github issues but there was no solution or currently open issue as
>> far as I could see.
>>
>> We are running on Kubernetes in GKE using the prometheus-operator Helm
>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
>> hours maximum, so our Prometheus pods also get killed and automatically
>> migrated by Kubernetes (the data is on a persistent volume of course). To
>> avoid loss of metrics, we run two identically configured replicas with
>> their own storage, scraping all the same targets.
>>
>> We monitor numerous GCE VMs that do batch processing, running anywhere
>> between a few minutes to several hours. This workload is bursty,
>> fluctuating between tens and hundreds of VMs active at any time, so
>> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
>> Prometheus usually handles this workload with about half a CPU core and 8GB
>> of RAM and if left to its own devices, the wal folder will shrink again
>> when the load decreases.
>>
>> The problem is that when there is a backlog and Prometheus is restarted
>> (due to the preemptive VM going away), it will use several times more RAM
>> to recover the wal folder. This often exhausts all the available memory on
>> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
>> over again, until I log in and delete the wal folder, losing several hours
>> of metrics. I have already doubled the size of the VMs just to accommodate
>> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
>> would triple the cost of these instances and Prometheus might still get
>> restarted when we roll out an update -- so this would probably not even
>> solve the issue properly.
>>
>> I don't know if there is something special in our use case, but I did
>> come across a blog describing the same high memory usage behaviour on
>> startup.
>>
>> I feel that unless there is a fix I can do, this would warrant either a
>> bug or feature request -- Prometheus should be able to recover without
>> operator intervention or losing metrics. And for a process running on
>> Kubernetes, we should be able to set memory "request" and "limit" values
>> that are close to actual expected usage, rather than 3-4 times the steady
>> state usage just to accommodate the memory requirements of the startup
>> phase.
>>
>> Please let me know what information I should provide, if any. I have some
>> graph screenshots that would be relevant.
>>
>> Many thanks,
>> Vik
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
My other sig is hilarious

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANx-tGgz631pR8LGDxWpw%2BGFQDAGezOWETWZAj4-n%3DSV-m3www%40mail.gmail.com.

Re: [prometheus-users] Prometheus wal folder and memory usage on startup

Reply via email to