[
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brad Willard updated SPARK-5075:
--------------------------------
Summary: Memory Leak when repartitioning SchemaRDD or running queries in
general (was: Memory Leak when repartitioning SchemaRDD from JSON)
> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
> Key: SPARK-5075
> URL: https://issues.apache.org/jira/browse/SPARK-5075
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 1.2.0
> Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
> Reporter: Brad Willard
> Labels: ec2, json, parquet, pyspark, repartition, s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save
> in parquet format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is
> expected. However once it attempts the repartition, it balloons over 2.5x in
> memory which is never returned making any subsequent operations fail from
> memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]