If that is your loop unrolled, then you are not doing parts of work at a time. That will execute all operations in one go when the write finally happens. That's OK, but may be part of the problem. For example if you are filtering for a subset, processing, and unioning, then that is just a harder and slower way of applying the transformation to all data at once.
On Thu, Mar 31, 2022 at 3:30 AM Joris Billen <joris.bil...@bigindustries.be> wrote: > Thanks for reply :-) > > I am using pyspark. Basicially my code (simplified is): > > df=spark.read.csv(hdfs://somehdfslocation) > df1=spark.sql (complex statement using df) > ... > dfx=spark.sql(complex statement using df x-1) > ... > dfx15.write() > > > What exactly is meant by "closing resources"? Is it just unpersisting > cached dataframes at the end and stopping the spark context explicitly: > sc.stop()? > > > FOr processing many years at once versus a chunk in a loop: I see that if > I go up to certain number of days, one iteration will start to have tasks > that fail. So I only take a limited number of days, and do this process > several times. Isnt this normal as you are always somehow limited in terms > of resources (I have 9 nodes wiht 32GB). Or is it like this that in theory > you could process any volume, in case you wait long enough? I guess spark > can only break down the tasks up to a certain level (based on the datasets' > and the intermediate results’ partitions) and at some moment you hit the > limit where your resources are not sufficient anymore to process such one > task? Maybe you can tweak it a bit, but in the end you’ll hit a limit? > > > > Concretely following topics would be interesting to find out more about > (links): > -where to see what you are still consuming after spark job ended if you > didnt close resources > -memory leaks for pyspark > -good article about closing resources (you find tons of snippets on how to > start spark context+ config for number/cores/memory of worker/executors > etc, but never saw a focus on making sure you clean up —> or is it just > stopping the spark context) > > > > > On 30 Mar 2022, at 21:24, Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > > It`s quite impossible for anyone to answer your question about what is > eating your memory, without even knowing what language you are using. > > If you are using C then it`s always pointers, that's the mem issue. > If you are using python, there can be some like not using context manager > like With Context Managers and Python's with Statement > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Frealpython.com%2Fpython-with-statement%2F&data=04%7C01%7Cjoris.billen%40bigindustries.be%7C4ed0d54ebb1949dd7dc708da1282e90b%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637842650741571990%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DfS3z2ahDT9B21NnbbN8AlEI3G2JX2FCwo9ZZCuzPVs%3D&reserved=0> > > And another can be not to close resources after use. > > In my experience you can process 3 years or more of data, IF you are > closing opened resources. > I use the web GUI http://spark:4040 to follow what spark is doing. > > > > > ons. 30. mar. 2022 kl. 17:41 skrev Joris Billen < > joris.bil...@bigindustries.be>: > >> Thanks for answer-much appreciated! This forum is very useful :-) >> >> I didnt know the sparkcontext stays alive. I guess this is eating up >> memory. The eviction means that he knows that he should clear some of the >> old cached memory to be able to store new one. In case anyone has good >> articles about memory leaks I would be interested to read. >> I will try to add following lines at the end of my job (as I cached the >> table in spark sql): >> >> >> *sqlContext.sql("UNCACHE TABLE mytableofinterest ")* >> *spark.stop()* >> >> >> Wrt looping: if I want to process 3 years of data, my modest cluster will >> never do it one go , I would expect? I have to break it down in smaller >> pieces and run that in a loop (1 day is already lots of data). >> >> >> >> Thanks! >> >> >> >> >> On 30 Mar 2022, at 17:25, Sean Owen <sro...@gmail.com> wrote: >> >> The Spark context does not stop when a job does. It stops when you stop >> it. There could be many ways mem can leak. Caching maybe - but it will >> evict. You should be clearing caches when no longer needed. >> >> I would guess it is something else your program holds on to in its logic. >> >> Also consider not looping; there is probably a faster way to do it in one >> go. >> >> On Wed, Mar 30, 2022, 10:16 AM Joris Billen < >> joris.bil...@bigindustries.be> wrote: >> >>> Hi, >>> I have a pyspark job submitted through spark-submit that does some heavy >>> processing for 1 day of data. It runs with no errors. I have to loop over >>> many days, so I run this spark job in a loop. I notice after couple >>> executions the memory is increasing on all worker nodes and eventually this >>> leads to faillures. My job does some caching, but I understand that when >>> the job ends successfully, then the sparkcontext is destroyed and the cache >>> should be cleared. However it seems that something keeps on filling the >>> memory a bit more and more after each run. THis is the memory behaviour >>> over time, which in the end will start leading to failures : >>> >>> (what we see is: green=physical memory used, green-blue=physical memory >>> cached, grey=memory capacity =straight line around 31GB ) >>> This runs on a healthy spark 2.4 and was optimized already to come to a >>> stable job in terms of spark-submit resources parameters like >>> driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait). >>> Any clue how to “really” clear the memory in between jobs? So basically >>> currently I can loop 10x and then need to restart my cluster so all memory >>> is cleared completely. >>> >>> >>> Thanks for any info! >>> >>> <Screenshot 2022-03-30 at 15.28.24.png> >> >> >> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > > >