Additionally I meant with modularization that jobs that have really nothing to do with each other should be in separate python programs
> On 5. Jun 2018, at 04:50, Thakrar, Jayesh <jthak...@conversantmedia.com> > wrote: > > Disclaimer - I use Spark with Scala and not Python. > > But I am guessing that Jorn's reference to modularization is to ensure that > you do the processing inside methods/functions and call those methods > sequentially. > I believe that as long as an RDD/dataset variable is in scope, its memory may > not be getting released. > By having functions, they will get out of scope and their memory can be > released. > > Also, assuming that the variables are not daisy-chained/inter-related as that > too will not make it easy. > > > From: Jay <jayadeep.jayara...@gmail.com> > Date: Monday, June 4, 2018 at 9:41 PM > To: Shuporno Choudhury <shuporno.choudh...@gmail.com> > Cc: "Jörn Franke [via Apache Spark User List]" > <ml+s1001560n32458...@n3.nabble.com>, <user@spark.apache.org> > Subject: Re: [PySpark] Releasing memory after a spark job is finished > > Can you tell us what version of Spark you are using and if Dynamic Allocation > is enabled ? > > Also, how are the files being read ? Is it a single read of all files using a > file matching regex or are you running different threads in the same pyspark > job? > > > > On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, > <shuporno.choudh...@gmail.com> wrote: > Thanks a lot for the insight. > Actually I have the exact same transformations for all the datasets, hence > only 1 python code. > Now, do you suggest that I run different spark-submit for all the different > datasets given that I have the exact same transformations? > > On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], > <ml+s1001560n32458...@n3.nabble.com> wrote: > Yes if they are independent with different transformations then I would > create a separate python program. Especially for big data processing > frameworks one should avoid to put everything in one big monotholic > applications. > > > On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email]> wrote: > > Hi, > > Thanks for the input. > I was trying to get the functionality first, hence I was using local mode. I > will be running on a cluster definitely but later. > > Sorry for my naivety, but can you please elaborate on the modularity concept > that you mentioned and how it will affect whatever I am already doing? > Do you mean running a different spark-submit for each different dataset when > you say 'an independent python program for each process '? > > On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] > <[hidden email]> wrote: > Why don’t you modularize your code and write for each process an independent > python program that is submitted via Spark? > > Not sure though if Spark local make sense. If you don’t have a cluster then a > normal python program can be much better. > > On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote: > > Hi everyone, > I am trying to run a pyspark code on some data sets sequentially [basically > 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write > modified data in parquet format to a target location] > Now, while running this pyspark code across multiple independent data sets > sequentially, the memory usage from the previous data set doesn't seem to get > released/cleared and hence spark's memory consumption (JVM memory consumption > from Task Manager) keeps on increasing till it fails at some data set. > So, is there a way to clear/remove dataframes that I know are not going to be > used later? > Basically, can I clear out some memory programmatically (in the pyspark code) > when processing for a particular data set ends? > At no point, I am caching any dataframe (so unpersist() is also not a > solution). > > I am running spark using local[*] as master. There is a single SparkSession > that is doing all the processing. > If it is not possible to clear out memory, what can be a better approach for > this problem? > > Can someone please help me with this and tell me if I am going wrong anywhere? > > --Thanks, > Shuporno Choudhury > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html > To start a new topic under Apache Spark User List, email [hidden email] > To unsubscribe from Apache Spark User List, click here. > NAML > > > -- > --Thanks, > Shuporno Choudhury > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html > To start a new topic under Apache Spark User List, email > ml+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click here. > NAML