[
https://issues.apache.org/jira/browse/SPARK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087197#comment-14087197
]
Apache Spark commented on SPARK-2787:
-------------------------------------
User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/1799
> Make sort-based shuffle write files directly when there is no sorting /
> aggregation and # of partitions is small
> ----------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-2787
> URL: https://issues.apache.org/jira/browse/SPARK-2787
> Project: Spark
> Issue Type: Improvement
> Reporter: Matei Zaharia
> Assignee: Matei Zaharia
>
> Right now sort-based shuffle is slower than hash-based for operations like
> groupByKey where data is passed straight from the map task to the reduce
> task, because it keeps building up a buffer in memory and spilling it to disk
> instead of directly opening more files (thus having more GC pressure), and
> then it has to read back and merge all the spills (incurring both
> serialization cost and GC pressure). When the number of partitions is small
> enough (say less than 100 or 200), we should just open N files, write stuff
> to them as in hash-based shuffle, and then concatenate them at the end to
> still end up with a single file (avoiding the many-files problem of the
> hash-based implementation).
> It may also be possible to avoid the concatenation but that introduces
> complexity in serving the files, so I'd try concatenating them for now. We
> can benchmark it to see the performance.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]