[
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603249#comment-15603249
]
Apache Spark commented on SPARK-16827:
--------------------------------------
User 'dreamworks007' has created a pull request for this issue:
https://github.com/apache/spark/pull/15616
> Stop reporting spill metrics as shuffle metrics
> -----------------------------------------------
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core
> Affects Versions: 2.0.0
> Reporter: Sital Kedia
> Assignee: Brian Cho
> Labels: performance
> Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
> SELECT userid
> FROM table1 a
> JOIN table2 b
> ON a.ds = '2016-07-15'
> AND b.ds = '2016-07-15'
> AND a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow. Digging a little
> into it, we found out that one of the stages produces excessive amount of
> shuffle data. Please note that this is a regression from Spark 1.6. Stage 2
> of the job which used to produce 32KB shuffle data with 1.6, now produces
> more than 400GB with Spark 2.0. We also tried turning off whole stage code
> generation but that did not help.
> PS - Even if the intermediate shuffle data size is huge, the job still
> produces accurate output.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]