[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

Brad Willard (JIRA) Tue, 06 Jan 2015 11:02:55 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brad Willard updated SPARK-5075:
--------------------------------
    Description: 
I'm trying to repartition a json dataset for better cpu optimization and save 
in parquet format for better performance. The Json dataset is about 200gb

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

rdd = sql_context.jsonFile('s3c://some_path')
rdd = rdd.repartition(256)
rdd.saveAsParquetFile('hdfs://some_path')

In ganglia when the dataset first loads it's about 200G in memory which is 
expected. However once it attempts the repartition, it balloons over 2.5x in 
memory which is never returned making any subsequent operations fail from 
memory errors.

https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png


I'm also seeing a similar memory leak behavior when running repeated queries on 
a dataset.

rdd = sql_context.parquetFile('hdfs://some_path')
rdd.registerTempTable('events')

sql_context.sql(  anything  )
sql_context.sql(  anything  )
sql_context.sql(  anything  )
sql_context.sql(  anything  )

will result in a memory usage pattern of.
http://cl.ly/image/180y2D3d1A0X

It seems like intermediate results are not being garbage collected or 
something. Eventually I have to kill my session to keep running queries.

  was:
I'm trying to repartition a json dataset for better cpu optimization and save 
in parquet format for better performance. The Json dataset is about 200gb

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

rdd = sql_context.jsonFile('s3c://some_path')
rdd = rdd.repartition(256)
rdd.saveAsParquetFile('hdfs://some_path')

In ganglia when the dataset first loads it's about 200G in memory which is 
expected. However once it attempts the repartition, it balloons over 2.5x in 
memory which is never returned making any subsequent operations fail from 
memory errors.

https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png



> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5075
>                 URL: https://issues.apache.org/jira/browse/SPARK-5075
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.2.0
>         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
>            Reporter: Brad Willard
>              Labels: ec2, json, parquet, pyspark, repartition, s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save 
> in parquet format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is 
> expected. However once it attempts the repartition, it balloons over 2.5x in 
> memory which is never returned making any subsequent operations fail from 
> memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
> I'm also seeing a similar memory leak behavior when running repeated queries 
> on a dataset.
> rdd = sql_context.parquetFile('hdfs://some_path')
> rdd.registerTempTable('events')
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> will result in a memory usage pattern of.
> http://cl.ly/image/180y2D3d1A0X
> It seems like intermediate results are not being garbage collected or 
> something. Eventually I have to kill my session to keep running queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

Reply via email to