Re: [PySpark] Releasing memory after a spark job is finished

Jörn Franke Mon, 04 Jun 2018 22:16:37 -0700

Additionally I meant with modularization that jobs that have really nothing to 
do with each other should be in separate python programs


> On 5. Jun 2018, at 04:50, Thakrar, Jayesh <jthak...@conversantmedia.com> 
> wrote:
> 
> Disclaimer - I use Spark with Scala and not Python.
>  
> But I am guessing that Jorn's reference to modularization is to ensure that 
> you do the processing inside methods/functions and call those methods 
> sequentially.
> I believe that as long as an RDD/dataset variable is in scope, its memory may 
> not be getting released.
> By having functions, they will get out of scope and their memory can be 
> released.
>  
> Also, assuming that the variables are not daisy-chained/inter-related as that 
> too will not make it easy.
>  
>  
> From: Jay <jayadeep.jayara...@gmail.com>
> Date: Monday, June 4, 2018 at 9:41 PM
> To: Shuporno Choudhury <shuporno.choudh...@gmail.com>
> Cc: "Jörn Franke [via Apache Spark User List]" 
> <ml+s1001560n32458...@n3.nabble.com>, <user@spark.apache.org>
> Subject: Re: [PySpark] Releasing memory after a spark job is finished
>  
> Can you tell us what version of Spark you are using and if Dynamic Allocation 
> is enabled ? 
>  
> Also, how are the files being read ? Is it a single read of all files using a 
> file matching regex or are you running different threads in the same pyspark 
> job?
>  
>  
> 
> On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, 
> <shuporno.choudh...@gmail.com> wrote:
> Thanks a lot for the insight.
> Actually I have the exact same transformations for all the datasets, hence 
> only 1 python code.
> Now, do you suggest that I run different spark-submit for all the different 
> datasets given that I have the exact same transformations?
>  
> On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], 
> <ml+s1001560n32458...@n3.nabble.com> wrote:
> Yes if they are independent with different transformations then I would 
> create a separate python program. Especially for big data processing 
> frameworks one should avoid to put everything in one big monotholic 
> applications.
>  
> 
> On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email]> wrote:
> 
> Hi,
>  
> Thanks for the input.
> I was trying to get the functionality first, hence I was using local mode. I 
> will be running on a cluster definitely but later.
>  
> Sorry for my naivety, but can you please elaborate on the modularity concept 
> that you mentioned and how it will affect whatever I am already doing?
> Do you mean running a different spark-submit for each different dataset when 
> you say 'an independent python program for each process '?
>  
> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] 
> <[hidden email]> wrote:
> Why don’t you modularize your code and write for each process an independent 
> python program that is submitted via Spark?
>  
> Not sure though if Spark local make sense. If you don’t have a cluster then a 
> normal python program can be much better.
> 
> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote:
> 
> Hi everyone,
> I am trying to run a pyspark code on some data sets sequentially [basically 
> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write 
> modified data in parquet format to a target location]
> Now, while running this pyspark code across multiple independent data sets 
> sequentially, the memory usage from the previous data set doesn't seem to get 
> released/cleared and hence spark's memory consumption (JVM memory consumption 
> from Task Manager) keeps on increasing till it fails at some data set.
> So, is there a way to clear/remove dataframes that I know are not going to be 
> used later? 
> Basically, can I clear out some memory programmatically (in the pyspark code) 
> when processing for a particular data set ends?
> At no point, I am caching any dataframe (so unpersist() is also not a 
> solution).
>  
> I am running spark using local[*] as master. There is a single SparkSession 
> that is doing all the processing.
> If it is not possible to clear out memory, what can be a better approach for 
> this problem?
>  
> Can someone please help me with this and tell me if I am going wrong anywhere?
>  
> --Thanks,
> Shuporno Choudhury
>  
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
> To start a new topic under Apache Spark User List, email [hidden email]
> To unsubscribe from Apache Spark User List, click here.
> NAML
> 
>  
> --
> --Thanks,
> Shuporno Choudhury
>  
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html
> To start a new topic under Apache Spark User List, email 
> ml+s1001560n1...@n3.nabble.com 
> To unsubscribe from Apache Spark User List, click here.
> NAML

Re: [PySpark] Releasing memory after a spark job is finished

Reply via email to