GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/1799

    SPARK-2787: Make sort-based shuffle write files directly when there's no 
sorting/aggregation and # partitions is small

    As described in https://issues.apache.org/jira/browse/SPARK-2787, right now 
sort-based shuffle is more expensive than hash-based for map operations that do 
no partial aggregation or sorting, such as groupByKey. This is because it has 
to serialize each data item twice (once when spilling to intermediate files, 
and then again when merging these files object-by-object). This patch adds a 
code path to just write separate files directly if the # of output partitions 
is small, and concatenate them at the end to produce a sorted file.
    
    On the unit test side, I added some tests that force or don't force this 
bypass path to be used, and checked that our tests for other features (e.g. all 
the operations) cover both cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark SPARK-2787

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1799.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1799
    
----
commit a42a102f0f05b01f129947c3ead2cd0674f7ea2e
Author: Matei Zaharia <[email protected]>
Date:   2014-08-05T02:25:49Z

    Move existing logic for writing partitioned files into ExternalSorter
    
    Also renamed ExternalSorter.write(Iterator) to insertAll, to match
    ExternalAppendOnlyMap

commit 82b187a56c1e115f5e4c7d5beed8d3deb6819a77
Author: Matei Zaharia <[email protected]>
Date:   2014-08-06T03:10:22Z

    Add code path to bypass merge-sort in ExternalSorter, and tests

commit f401c78638a85a87a06d4bf6d880bf9f7b9c1f4a
Author: Matei Zaharia <[email protected]>
Date:   2014-08-06T03:12:06Z

    Fix some comments

commit 2afb4122021b3c7655a1e39ab9e11499f5cb3e18
Author: Matei Zaharia <[email protected]>
Date:   2014-08-06T03:27:17Z

    Add docs for shuffle manager properties, and allow short names for them

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to