[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603249#comment-15603249
 ] 

Apache Spark commented on SPARK-16827:
--------------------------------------

User 'dreamworks007' has created a pull request for this issue:
https://github.com/apache/spark/pull/15616

> Stop reporting spill metrics as shuffle metrics
> -----------------------------------------------
>
>                 Key: SPARK-16827
>                 URL: https://issues.apache.org/jira/browse/SPARK-16827
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>            Assignee: Brian Cho
>              Labels: performance
>             Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>      FROM  table1 a
>      JOIN table2 b
>       ON    a.ds = '2016-07-15'
>       AND  b.ds = '2016-07-15'
>       AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to