Hi Julien,

Thanks for clarifying that. In that case I'll see if the issue will recur
with 2.19.2 in the next few weeks.

Vik

On Wed, 1 Jul 2020 at 19:08, Julien Pivotto <[email protected]>
wrote:

> When 2.19 will run then it will create mmaped head which will improve that.
>
> I agree that starting 2.19 with a 2.18 wal won't make a change.
>
> Le mer. 1 juil. 2020 à 19:55, Viktor Radnai <[email protected]> a
> écrit :
>
>> Hi again Ben,
>>
>> Unfortunately upgrading to 2.19.2 does not solve the startup issue.
>> Prometheus gets OOMKilled before even starting to parse the last 25
>> segments which represent the last 50 minutes worth of data. Based on this
>> the estimated memory requirement should be somewhere between 60-70GB but
>> the iworker node only has 52GB. The other Prometheus pod currently consumes
>> 7.7GB.
>>
>> The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently
>> reinstated a previously set 40GB memory limit and updated the replicaset to
>> increase it back to 50GB -- this is the reason for the second Prometheus
>> restart and the slightly higher plateau for the last two OOMs.
>>
>> Unless there is a way to move some WAL segments out and the restore them
>> later, I'll try to delete the last 50 minutes worth of segments to get the
>> pod to come up.
>>
>> Thanks,
>> Vik
>>
>> On Wed, 1 Jul 2020 at 16:39, Viktor Radnai <[email protected]>
>> wrote:
>>
>>> Hi Ben,
>>>
>>> We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves
>>> the problem. I currently have one of the two replicas in production
>>> crashlooping so I'll try to roll this out in the next few hours and report
>>> back.
>>>
>>> Thanks,
>>> Vik
>>>
>>> On Wed, 1 Jul 2020 at 16:32, Ben Kochie <[email protected]> wrote:
>>>
>>>> What version of Prometheus do you have deployed? We've made several
>>>> major improvements to WAL handling and startup in the last couple of
>>>> releases.
>>>>
>>>> I would recommend upgrading to 2.19.2 if you haven't.
>>>>
>>>> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We have a recurring problem with Prometheus repeatedly getting
>>>>> OOMKilled on startup while trying to process the write ahead log. I tried
>>>>> to look through Github issues but there was no solution or currently open
>>>>> issue as far as I could see.
>>>>>
>>>>> We are running on Kubernetes in GKE using the prometheus-operator Helm
>>>>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
>>>>> hours maximum, so our Prometheus pods also get killed and automatically
>>>>> migrated by Kubernetes (the data is on a persistent volume of course). To
>>>>> avoid loss of metrics, we run two identically configured replicas with
>>>>> their own storage, scraping all the same targets.
>>>>>
>>>>> We monitor numerous GCE VMs that do batch processing, running anywhere
>>>>> between a few minutes to several hours. This workload is bursty,
>>>>> fluctuating between tens and hundreds of VMs active at any time, so
>>>>> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
>>>>> Prometheus usually handles this workload with about half a CPU core and 
>>>>> 8GB
>>>>> of RAM and if left to its own devices, the wal folder will shrink again
>>>>> when the load decreases.
>>>>>
>>>>> The problem is that when there is a backlog and Prometheus is
>>>>> restarted (due to the preemptive VM going away), it will use several times
>>>>> more RAM to recover the wal folder. This often exhausts all the available
>>>>> memory on the Kubernetes worker, so Prometheus is killed by the OOM killed
>>>>> over and over again, until I log in and delete the wal folder, losing
>>>>> several hours of metrics. I have already doubled the size of the VMs just
>>>>> to accommodate Prometheus and I am reluctant to do this again. Running
>>>>> non-preemptive VMs would triple the cost of these instances and Prometheus
>>>>> might still get restarted when we roll out an update -- so this would
>>>>> probably not even solve the issue properly.
>>>>>
>>>>> I don't know if there is something special in our use case, but I did
>>>>> come across a blog describing the same high memory usage behaviour on
>>>>> startup.
>>>>>
>>>>> I feel that unless there is a fix I can do, this would warrant either
>>>>> a bug or feature request -- Prometheus should be able to recover without
>>>>> operator intervention or losing metrics. And for a process running on
>>>>> Kubernetes, we should be able to set memory "request" and "limit" values
>>>>> that are close to actual expected usage, rather than 3-4 times the steady
>>>>> state usage just to accommodate the memory requirements of the startup
>>>>> phase.
>>>>>
>>>>> Please let me know what information I should provide, if any. I have
>>>>> some graph screenshots that would be relevant.
>>>>>
>>>>> Many thanks,
>>>>> Vik
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>
>>> --
>>> My other sig is hilarious
>>>
>>
>>
>> --
>> My other sig is hilarious
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
My other sig is hilarious

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANx-tGjbhfAMydBFk2L76qZwHZEAFnQ4cWA0g3Bz%3DqnZViNQTg%40mail.gmail.com.

Reply via email to