Matei Zaharia created SPARK-2787:
------------------------------------
Summary: Make sort-based shuffle write files directly when there
is no sorting / aggregation and # of partitions is small
Key: SPARK-2787
URL: https://issues.apache.org/jira/browse/SPARK-2787
Project: Spark
Issue Type: Improvement
Reporter: Matei Zaharia
Right now sort-based shuffle is slower than hash-based for operations like
groupByKey where data is passed straight from the map task to the reduce task,
because it keeps building up a buffer in memory and spilling it to disk instead
of directly opening more files (thus having more GC pressure), and then it has
to read back and merge all the spills (incurring both serialization cost and GC
pressure). When the number of partitions is small enough (say less than 100 or
200), we should just open N files, write stuff to them as in hash-based
shuffle, and then concatenate them at the end to still end up with a single
file (avoiding the many-files problem of the hash-based implementation).
It may also be possible to avoid the concatenation but that introduces
complexity in serving the files, so I'd try concatenating them for now. We can
benchmark it to see the performance.
--
This message was sent by Atlassian JIRA
(v6.2#6252)