Hi Matthias,

Thanks, I think this should definitely help but not sure if it will always
solve the problem. If I understand it correctly, the WAL holds 6 hours of
data and in our experience the high water mark for memory usage seems to be
about 3-4 times the WAL size. So while processing 2 hours, you might go
higher than normal, but not severalt times higher.

What would be very nice if Prometheus would try to observe the rlimit set
for maximum virtual memory size and flush the WAL when it gets close to
that. When Prometheus starts up, it already prints the values (if set):
level=info ts=2020-07-01T15:31:50.711Z caller=main.go:341
vm_limits="(soft=unlimited, hard=unlimited)"

I tried setting these with a small Bash script wrapper and ulimit, but this
resulted in a Golang OOM error and termination instead of the Linux OOM
killer and termination :)

Many thanks,
Vik

On Wed, 1 Jul 2020 at 16:27, Matthias Rampke <[email protected]> wrote:

> I have been thinking about this problem as well, since we ran into a
> similar issue yesterday. In our case, Prometheus had already failed to
> write out a TSDB block for a few hours but kept on piling data into the
> head block.
>
> Could TSDB write out blocks *during* WAL recovery? Say, for every two
> hours' worth of WAL or even more frequently, it could pause recovery, write
> a block, delete the WAL up to that point, continue recovery. This would put
> something of a bound on the memory usage during recovery, and alleviate the
> issue that recovery from out-of-memory takes *even more memory*.
>
> Would this help in your case?
>
> /MR
>
>
> On Wed, Jul 1, 2020 at 3:06 PM Viktor Radnai <[email protected]>
> wrote:
>
>> Hi all,
>>
>> We have a recurring problem with Prometheus repeatedly getting OOMKilled
>> on startup while trying to process the write ahead log. I tried to look
>> through Github issues but there was no solution or currently open issue as
>> far as I could see.
>>
>> We are running on Kubernetes in GKE using the prometheus-operator Helm
>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
>> hours maximum, so our Prometheus pods also get killed and automatically
>> migrated by Kubernetes (the data is on a persistent volume of course). To
>> avoid loss of metrics, we run two identically configured replicas with
>> their own storage, scraping all the same targets.
>>
>> We monitor numerous GCE VMs that do batch processing, running anywhere
>> between a few minutes to several hours. This workload is bursty,
>> fluctuating between tens and hundreds of VMs active at any time, so
>> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
>> Prometheus usually handles this workload with about half a CPU core and 8GB
>> of RAM and if left to its own devices, the wal folder will shrink again
>> when the load decreases.
>>
>> The problem is that when there is a backlog and Prometheus is restarted
>> (due to the preemptive VM going away), it will use several times more RAM
>> to recover the wal folder. This often exhausts all the available memory on
>> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
>> over again, until I log in and delete the wal folder, losing several hours
>> of metrics. I have already doubled the size of the VMs just to accommodate
>> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
>> would triple the cost of these instances and Prometheus might still get
>> restarted when we roll out an update -- so this would probably not even
>> solve the issue properly.
>>
>> I don't know if there is something special in our use case, but I did
>> come across a blog describing the same high memory usage behaviour on
>> startup.
>>
>> I feel that unless there is a fix I can do, this would warrant either a
>> bug or feature request -- Prometheus should be able to recover without
>> operator intervention or losing metrics. And for a process running on
>> Kubernetes, we should be able to set memory "request" and "limit" values
>> that are close to actual expected usage, rather than 3-4 times the steady
>> state usage just to accommodate the memory requirements of the startup
>> phase.
>>
>> Please let me know what information I should provide, if any. I have some
>> graph screenshots that would be relevant.
>>
>> Many thanks,
>> Vik
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
My other sig is hilarious

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANx-tGitAjGc92mJzeKTizk9X3jEsMJnorTDA5OkYUA1PC7ZtQ%40mail.gmail.com.

Reply via email to