[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

Davies Liu (JIRA) Tue, 03 May 2016 10:20:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269132#comment-15269132
 ]


Davies Liu commented on SPARK-12837:
------------------------------------

With spark.driver.maxResultSize=1m, the simply job will fail
{code}
>>> sc.range(0, 1000, 1, 1000).count()
Job aborted due to stage failure: Total size of serialized results of 370 tasks 
(1024.7 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)
{code}

It meant the average size of serialized task result is about 2.8k

After some debugging the actual result is 60 byte, all others are accumulator 
updates (23 accumulators), but this query should not update so many 
accumulators. It seems that we are collecting all the accumulators to driver 
(not just the updated ones).

Another thing is that, each accumulator will be serialized to about 100 Bytes, 
we could also reduce the size.

cc [~cloud_fan]


> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12837
>                 URL: https://issues.apache.org/jira/browse/SPARK-12837
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.0
>            Reporter: Tien-Dung LE
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

Reply via email to