[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

Brad Willard (JIRA) Tue, 06 Jan 2015 11:01:03 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brad Willard updated SPARK-5075:
--------------------------------
    Summary: Memory Leak when repartitioning SchemaRDD or running queries in 
general  (was: Memory Leak when repartitioning SchemaRDD from JSON)

> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5075
>                 URL: https://issues.apache.org/jira/browse/SPARK-5075
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.2.0
>         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
>            Reporter: Brad Willard
>              Labels: ec2, json, parquet, pyspark, repartition, s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save 
> in parquet format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is 
> expected. However once it attempts the repartition, it balloons over 2.5x in 
> memory which is never returned making any subsequent operations fail from 
> memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

Reply via email to