[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111631#comment-14111631
 ] 

Matei Zaharia commented on SPARK-2926:
--------------------------------------

I see, thanks for posting the benchmarks. This does seem like it's worth 
investigating further. Can you also run some tests with other aggregation 
factors? Also, a few notes on the configuration:
* There's a big change in behavior when you go above 200 reduce tasks because 
ExternalSorter does the same thing as hash-based shuffle if the # of reduce 
tasks is below 200.
* When doing these kind of tests, set spark.kryo.referenceTracking = false, 
otherwise serialization will be a major CPU cost.
* We need to try other types of keys as well, e.g. integers or longer strings. 
As I said the cost to compare elements will depend on their datatype.

Finally, it would be great if this proposal reused parts of ExternalSorter or 
otherwise shared code for it. It looks like it's not a clear win in all cases, 
but maybe for example we can use it in sortByKey, and later in groupBy / join 
when we update those to deal with it.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> ------------------------------------------------------------------
>
>                 Key: SPARK-2926
>                 URL: https://issues.apache.org/jira/browse/SPARK-2926
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 1.1.0
>            Reporter: Saisai Shao
>         Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to