[jira] [Resolved] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

Reynold Xin (JIRA) Wed, 12 Oct 2016 16:02:40 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin resolved SPARK-16827.
---------------------------------
       Resolution: Fixed
         Assignee: Brian Cho
    Fix Version/s: 2.1.0

> Stop reporting spill metrics as shuffle metrics
> -----------------------------------------------
>
>                 Key: SPARK-16827
>                 URL: https://issues.apache.org/jira/browse/SPARK-16827
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>            Assignee: Brian Cho
>              Labels: performance
>             Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>      FROM  table1 a
>      JOIN table2 b
>       ON    a.ds = '2016-07-15'
>       AND  b.ds = '2016-07-15'
>       AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

Reply via email to