When 2.19 will run then it will create mmaped head which will improve that.
I agree that starting 2.19 with a 2.18 wal won't make a change. Le mer. 1 juil. 2020 à 19:55, Viktor Radnai <[email protected]> a écrit : > Hi again Ben, > > Unfortunately upgrading to 2.19.2 does not solve the startup issue. > Prometheus gets OOMKilled before even starting to parse the last 25 > segments which represent the last 50 minutes worth of data. Based on this > the estimated memory requirement should be somewhere between 60-70GB but > the iworker node only has 52GB. The other Prometheus pod currently consumes > 7.7GB. > > The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently > reinstated a previously set 40GB memory limit and updated the replicaset to > increase it back to 50GB -- this is the reason for the second Prometheus > restart and the slightly higher plateau for the last two OOMs. > > Unless there is a way to move some WAL segments out and the restore them > later, I'll try to delete the last 50 minutes worth of segments to get the > pod to come up. > > Thanks, > Vik > > On Wed, 1 Jul 2020 at 16:39, Viktor Radnai <[email protected]> > wrote: > >> Hi Ben, >> >> We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves >> the problem. I currently have one of the two replicas in production >> crashlooping so I'll try to roll this out in the next few hours and report >> back. >> >> Thanks, >> Vik >> >> On Wed, 1 Jul 2020 at 16:32, Ben Kochie <[email protected]> wrote: >> >>> What version of Prometheus do you have deployed? We've made several >>> major improvements to WAL handling and startup in the last couple of >>> releases. >>> >>> I would recommend upgrading to 2.19.2 if you haven't. >>> >>> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> We have a recurring problem with Prometheus repeatedly getting >>>> OOMKilled on startup while trying to process the write ahead log. I tried >>>> to look through Github issues but there was no solution or currently open >>>> issue as far as I could see. >>>> >>>> We are running on Kubernetes in GKE using the prometheus-operator Helm >>>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24 >>>> hours maximum, so our Prometheus pods also get killed and automatically >>>> migrated by Kubernetes (the data is on a persistent volume of course). To >>>> avoid loss of metrics, we run two identically configured replicas with >>>> their own storage, scraping all the same targets. >>>> >>>> We monitor numerous GCE VMs that do batch processing, running anywhere >>>> between a few minutes to several hours. This workload is bursty, >>>> fluctuating between tens and hundreds of VMs active at any time, so >>>> sometimes the Prometheus wal folder grows to between 10-15GB in size. >>>> Prometheus usually handles this workload with about half a CPU core and 8GB >>>> of RAM and if left to its own devices, the wal folder will shrink again >>>> when the load decreases. >>>> >>>> The problem is that when there is a backlog and Prometheus is restarted >>>> (due to the preemptive VM going away), it will use several times more RAM >>>> to recover the wal folder. This often exhausts all the available memory on >>>> the Kubernetes worker, so Prometheus is killed by the OOM killed over and >>>> over again, until I log in and delete the wal folder, losing several hours >>>> of metrics. I have already doubled the size of the VMs just to accommodate >>>> Prometheus and I am reluctant to do this again. Running non-preemptive VMs >>>> would triple the cost of these instances and Prometheus might still get >>>> restarted when we roll out an update -- so this would probably not even >>>> solve the issue properly. >>>> >>>> I don't know if there is something special in our use case, but I did >>>> come across a blog describing the same high memory usage behaviour on >>>> startup. >>>> >>>> I feel that unless there is a fix I can do, this would warrant either a >>>> bug or feature request -- Prometheus should be able to recover without >>>> operator intervention or losing metrics. And for a process running on >>>> Kubernetes, we should be able to set memory "request" and "limit" values >>>> that are close to actual expected usage, rather than 3-4 times the steady >>>> state usage just to accommodate the memory requirements of the startup >>>> phase. >>>> >>>> Please let me know what information I should provide, if any. I have >>>> some graph screenshots that would be relevant. >>>> >>>> Many thanks, >>>> Vik >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >> >> -- >> My other sig is hilarious >> > > > -- > My other sig is hilarious > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com > <https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAFJ6V0qo0Dp5ctRn8iatUDzEWJfCtNK-SQ%3DcT3BZgXcp00wqGQ%40mail.gmail.com.

