Why don’t you modularize your code and write for each process an independent 
python program that is submitted via Spark?

Not sure though if Spark local make sense. If you don’t have a cluster then a 
normal python program can be much better.

> On 4. Jun 2018, at 21:37, Shuporno Choudhury <shuporno.choudh...@gmail.com> 
> wrote:
> 
> Hi everyone,
> I am trying to run a pyspark code on some data sets sequentially [basically 
> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write 
> modified data in parquet format to a target location]
> Now, while running this pyspark code across multiple independent data sets 
> sequentially, the memory usage from the previous data set doesn't seem to get 
> released/cleared and hence spark's memory consumption (JVM memory consumption 
> from Task Manager) keeps on increasing till it fails at some data set.
> So, is there a way to clear/remove dataframes that I know are not going to be 
> used later? 
> Basically, can I clear out some memory programmatically (in the pyspark code) 
> when processing for a particular data set ends?
> At no point, I am caching any dataframe (so unpersist() is also not a 
> solution).
> 
> I am running spark using local[*] as master. There is a single SparkSession 
> that is doing all the processing.
> If it is not possible to clear out memory, what can be a better approach for 
> this problem?
> 
> Can someone please help me with this and tell me if I am going wrong anywhere?
> 
> --Thanks,
> Shuporno Choudhury

Reply via email to