Tien-Dung LE created SPARK-12837:
------------------------------------

             Summary: Spark driver requires large memory space for serialized 
results even there are no data collected to the driver
                 Key: SPARK-12837
                 URL: https://issues.apache.org/jira/browse/SPARK-12837
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 1.6.0, 1.5.2
            Reporter: Tien-Dung LE


Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:shell}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:scala}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:scala}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to