Thanks Ben, I think sharding may be the way to go, most likely using Thanos.
I would like to continue offering a single interface for all of production and on the few hundred VMs we scrape (which cause our problem), we only run the node and process exporters. I may stop collecting some metrics but probably not enough to make a difference. Perhaps there are too many labels on the data (we collect most GCE labels and metadata and the VMs are relatively short-lived). I have already doubled the kubernetes node size and it would be wasteful to do so again in order to satisfy a memory spike that only occurs once or twice a week for up to an hour. Anyway, if there isn't a feature request open already to limit Promeheus's memory usage on startup, then I would like to open one. What I have in mind is being able to set a limit (similar to Java's -Xmx flag) which closely matches the memory limit set on the Prometheus pod. If Prometheus could stay below this limit during startup then all would be well, it could be scheduled efficiently by Kubernetes, or the user could allocate the appropriate size VM. If Prometheus exceeds this limit and OOMKilled or exits with an error after startup while collecting metrics, then the limit is too small. This would be the ideal solution, but Matthias's proposal would probably help as well. I understand that adding more memory would solve the problem but getting an infinite OOM loop during startup because recovering the WAL uses approximately 3-5 times the steady state memory usage on startup is difficult to accommodate. It really leaves two choices: 1 allocating much more RAM than needed, or 2. accepting the loss of the WAL and the last 6 hours of metrics every time this issue occurs. And the same problem would also occur if I ran Prometheus inside VMs, I would need to run a VM that costs 4 times as much as necessary for normal operation. At least on Kubernetes the excess capacity may be used up by the cluster, although currently we already have a smaller number of larger nodes than ideal for our workload just because of Prometheus. Would this feature request make sense? Is it even remotely feasible? Many thanks, Vik On Wed, 8 Jul 2020 at 20:06, Ben Kochie <[email protected]> wrote: > * Get a bigger server > * Reduce the number of metrics you collect > * Shard your server > Probably some combination of all of these. > > On Wed, Jul 8, 2020 at 8:21 PM Viktor Radnai <[email protected]> > wrote: > >> Hi Ben, Julien and all, >> >> To follow up on my issue from last week, the OOM loop does occur even >> with Prometheus 2.19.2. >> >> This time around the instance has just enough memory to complete WAL >> replay but it OOMs immediately after that, this could be an improvement or >> just a coincidence. The WAL folder is about 16GB and the OOM occurs at >> around 43GB (due to the Kubernetes worker running out of memory). Anything >> else I could try? >> >> Thanks, >> Vik >> >> On Wed, 1 Jul 2020 at 19:10, Viktor Radnai <[email protected]> >> wrote: >> >>> Hi Julien, >>> >>> Thanks for clarifying that. In that case I'll see if the issue will >>> recur with 2.19.2 in the next few weeks. >>> >>> Vik >>> >>> On Wed, 1 Jul 2020 at 19:08, Julien Pivotto <[email protected]> >>> wrote: >>> >>>> When 2.19 will run then it will create mmaped head which will improve >>>> that. >>>> >>>> I agree that starting 2.19 with a 2.18 wal won't make a change. >>>> >>>> Le mer. 1 juil. 2020 à 19:55, Viktor Radnai <[email protected]> >>>> a écrit : >>>> >>>>> Hi again Ben, >>>>> >>>>> Unfortunately upgrading to 2.19.2 does not solve the startup issue. >>>>> Prometheus gets OOMKilled before even starting to parse the last 25 >>>>> segments which represent the last 50 minutes worth of data. Based on this >>>>> the estimated memory requirement should be somewhere between 60-70GB but >>>>> the iworker node only has 52GB. The other Prometheus pod currently >>>>> consumes >>>>> 7.7GB. >>>>> >>>>> The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently >>>>> reinstated a previously set 40GB memory limit and updated the replicaset >>>>> to >>>>> increase it back to 50GB -- this is the reason for the second Prometheus >>>>> restart and the slightly higher plateau for the last two OOMs. >>>>> >>>>> Unless there is a way to move some WAL segments out and the restore >>>>> them later, I'll try to delete the last 50 minutes worth of segments to >>>>> get >>>>> the pod to come up. >>>>> >>>>> Thanks, >>>>> Vik >>>>> >>>>> On Wed, 1 Jul 2020 at 16:39, Viktor Radnai <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Ben, >>>>>> >>>>>> We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this >>>>>> solves the problem. I currently have one of the two replicas in >>>>>> production >>>>>> crashlooping so I'll try to roll this out in the next few hours and >>>>>> report >>>>>> back. >>>>>> >>>>>> Thanks, >>>>>> Vik >>>>>> >>>>>> On Wed, 1 Jul 2020 at 16:32, Ben Kochie <[email protected]> wrote: >>>>>> >>>>>>> What version of Prometheus do you have deployed? We've made several >>>>>>> major improvements to WAL handling and startup in the last couple of >>>>>>> releases. >>>>>>> >>>>>>> I would recommend upgrading to 2.19.2 if you haven't. >>>>>>> >>>>>>> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> We have a recurring problem with Prometheus repeatedly getting >>>>>>>> OOMKilled on startup while trying to process the write ahead log. I >>>>>>>> tried >>>>>>>> to look through Github issues but there was no solution or currently >>>>>>>> open >>>>>>>> issue as far as I could see. >>>>>>>> >>>>>>>> We are running on Kubernetes in GKE using the prometheus-operator >>>>>>>> Helm chart, using Google Cloud's Preemptible VMs. These VMs get killed >>>>>>>> every 24 hours maximum, so our Prometheus pods also get killed and >>>>>>>> automatically migrated by Kubernetes (the data is on a persistent >>>>>>>> volume of >>>>>>>> course). To avoid loss of metrics, we run two identically configured >>>>>>>> replicas with their own storage, scraping all the same targets. >>>>>>>> >>>>>>>> We monitor numerous GCE VMs that do batch processing, running >>>>>>>> anywhere between a few minutes to several hours. This workload is >>>>>>>> bursty, >>>>>>>> fluctuating between tens and hundreds of VMs active at any time, so >>>>>>>> sometimes the Prometheus wal folder grows to between 10-15GB in size. >>>>>>>> Prometheus usually handles this workload with about half a CPU core >>>>>>>> and 8GB >>>>>>>> of RAM and if left to its own devices, the wal folder will shrink again >>>>>>>> when the load decreases. >>>>>>>> >>>>>>>> The problem is that when there is a backlog and Prometheus is >>>>>>>> restarted (due to the preemptive VM going away), it will use several >>>>>>>> times >>>>>>>> more RAM to recover the wal folder. This often exhausts all the >>>>>>>> available >>>>>>>> memory on the Kubernetes worker, so Prometheus is killed by the OOM >>>>>>>> killed >>>>>>>> over and over again, until I log in and delete the wal folder, losing >>>>>>>> several hours of metrics. I have already doubled the size of the VMs >>>>>>>> just >>>>>>>> to accommodate Prometheus and I am reluctant to do this again. Running >>>>>>>> non-preemptive VMs would triple the cost of these instances and >>>>>>>> Prometheus >>>>>>>> might still get restarted when we roll out an update -- so this would >>>>>>>> probably not even solve the issue properly. >>>>>>>> >>>>>>>> I don't know if there is something special in our use case, but I >>>>>>>> did come across a blog describing the same high memory usage behaviour >>>>>>>> on >>>>>>>> startup. >>>>>>>> >>>>>>>> I feel that unless there is a fix I can do, this would warrant >>>>>>>> either a bug or feature request -- Prometheus should be able to recover >>>>>>>> without operator intervention or losing metrics. And for a process >>>>>>>> running >>>>>>>> on Kubernetes, we should be able to set memory "request" and "limit" >>>>>>>> values >>>>>>>> that are close to actual expected usage, rather than 3-4 times the >>>>>>>> steady >>>>>>>> state usage just to accommodate the memory requirements of the startup >>>>>>>> phase. >>>>>>>> >>>>>>>> Please let me know what information I should provide, if any. I >>>>>>>> have some graph screenshots that would be relevant. >>>>>>>> >>>>>>>> Many thanks, >>>>>>>> Vik >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Prometheus Users" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> My other sig is hilarious >>>>>> >>>>> >>>>> >>>>> -- >>>>> My other sig is hilarious >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/CANx-tGj6rBmimfUVGwuWD1%3D03fdvkCeYOote1huXBN2Kh2n08A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>> >>> -- >>> My other sig is hilarious >>> >> >> >> -- >> My other sig is hilarious >> > -- My other sig is hilarious -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CANx-tGjdTiJ30Dmzt8jvS6EqE%3DrUaO%3DzH%2B8Xv1JS4TQKYJ%2B-Og%40mail.gmail.com.

