[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Li Yuanjian updated SPARK-2926: ------------------------------- Attachment: Spark Shuffle Test Report on Spark2.x.pdf [~jerryshao] Hi saisai, thanks for your advise, I added a test report according to your suggestion. As described in the report, I only compare two shuffle mode in 'sort-by-key' workload because other test workloads shared same code paths in POC implementation(SortShuffleWriter with BlockStoreShuffleReader). Also add a config( [code link|https://github.com/apache/spark/pull/19745/commits/fe9394eadf8ea51af2b2cb41b5b42981fa600752] ) just to force shutting down SerializedShuffle in 'sort-by-key' workload, otherwise both of master and POC use the SerializedShuffle. For sort-by-key work around after closing Serialized Shuffle, the POC version can brings 1.44x faster than current master, although map side stage 1.16x slower, but reducer stage has 9.4x boosting. > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > ------------------------------------------------------------------ > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 1.1.0 > Reporter: Saisai Shao > Assignee: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on > Spark 2.x.pdf, Spark Shuffle Test Report on Spark2.x.pdf, Spark Shuffle Test > Report(contd).pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org