[ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Willard updated SPARK-5075:
--------------------------------
    Labels: ec2 json memory-leak memory_leak parquet pyspark repartition s3  
(was: ec2 json parquet pyspark repartition s3)

> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5075
>                 URL: https://issues.apache.org/jira/browse/SPARK-5075
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.2.0
>         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
>            Reporter: Brad Willard
>              Labels: ec2, json, memory-leak, memory_leak, parquet, pyspark, 
> repartition, s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save 
> in parquet format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is 
> expected. However once it attempts the repartition, it balloons over 2.5x in 
> memory which is never returned making any subsequent operations fail from 
> memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
> I'm also seeing a similar memory leak behavior when running repeated queries 
> on a dataset.
> rdd = sql_context.parquetFile('hdfs://some_path')
> rdd.registerTempTable('events')
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> will result in a memory usage pattern of.
> http://cl.ly/image/180y2D3d1A0X
> It seems like intermediate results are not being garbage collected or 
> something. Eventually I have to kill my session to keep running queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to