[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570446#comment-15570446 ]
Gaoxiang Liu edited comment on SPARK-16827 at 10/13/16 1:14 AM: ---------------------------------------------------------------- [~rxin], for this one, if I want to add spill time metrics, do you suggest I create a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new class (eg SpillWriteMetrics) inherit from it, and then pass parent class(DiskWriteMetrics) to UnsafeSorterSpillWriter https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209 ? Or do you suggest rename the ShuffleWriteMetrics class to something like WriteMetrics ? was (Author: dreamworks007): [~rxin], for this one, if I want to add spill metrics, do you suggest I create a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new class (eg SpillWriteMetrics) inherit from it, and then pass parent class(DiskWriteMetrics) to UnsafeSorterSpillWriter https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209 ? Or do you suggest rename the ShuffleWriteMetrics class to something like WriteMetrics ? > Stop reporting spill metrics as shuffle metrics > ----------------------------------------------- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 2.0.0 > Reporter: Sital Kedia > Assignee: Brian Cho > Labels: performance > Fix For: 2.1.0 > > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ON a.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org