[ 
https://issues.apache.org/jira/browse/SPARK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087197#comment-14087197
 ] 

Apache Spark commented on SPARK-2787:
-------------------------------------

User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/1799

> Make sort-based shuffle write files directly when there is no sorting / 
> aggregation and # of partitions is small
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2787
>                 URL: https://issues.apache.org/jira/browse/SPARK-2787
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>
> Right now sort-based shuffle is slower than hash-based for operations like 
> groupByKey where data is passed straight from the map task to the reduce 
> task, because it keeps building up a buffer in memory and spilling it to disk 
> instead of directly opening more files (thus having more GC pressure), and 
> then it has to read back and merge all the spills (incurring both 
> serialization cost and GC pressure). When the number of partitions is small 
> enough (say less than 100 or 200), we should just open N files, write stuff 
> to them as in hash-based shuffle, and then concatenate them at the end to 
> still end up with a single file (avoiding the many-files problem of the 
> hash-based implementation).
> It may also be possible to avoid the concatenation but that introduces 
> complexity in serving the files, so I'd try concatenating them for now. We 
> can benchmark it to see the performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to