GitHub user mateiz opened a pull request:
https://github.com/apache/spark/pull/1799
SPARK-2787: Make sort-based shuffle write files directly when there's no
sorting/aggregation and # partitions is small
As described in https://issues.apache.org/jira/browse/SPARK-2787, right now
sort-based shuffle is more expensive than hash-based for map operations that do
no partial aggregation or sorting, such as groupByKey. This is because it has
to serialize each data item twice (once when spilling to intermediate files,
and then again when merging these files object-by-object). This patch adds a
code path to just write separate files directly if the # of output partitions
is small, and concatenate them at the end to produce a sorted file.
On the unit test side, I added some tests that force or don't force this
bypass path to be used, and checked that our tests for other features (e.g. all
the operations) cover both cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mateiz/spark SPARK-2787
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1799.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1799
----
commit a42a102f0f05b01f129947c3ead2cd0674f7ea2e
Author: Matei Zaharia <[email protected]>
Date: 2014-08-05T02:25:49Z
Move existing logic for writing partitioned files into ExternalSorter
Also renamed ExternalSorter.write(Iterator) to insertAll, to match
ExternalAppendOnlyMap
commit 82b187a56c1e115f5e4c7d5beed8d3deb6819a77
Author: Matei Zaharia <[email protected]>
Date: 2014-08-06T03:10:22Z
Add code path to bypass merge-sort in ExternalSorter, and tests
commit f401c78638a85a87a06d4bf6d880bf9f7b9c1f4a
Author: Matei Zaharia <[email protected]>
Date: 2014-08-06T03:12:06Z
Fix some comments
commit 2afb4122021b3c7655a1e39ab9e11499f5cb3e18
Author: Matei Zaharia <[email protected]>
Date: 2014-08-06T03:27:17Z
Add docs for shuffle manager properties, and allow short names for them
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]