[GitHub] spark pull request #22010: [SPARK-21436] Take advantage of known partioner f...

holdenk Mon, 06 Aug 2018 11:08:43 -0700

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/22010


    [SPARK-21436] Take advantage of known partioner for distinct on RDDs to 
avoid a shuffle

    
    ## What changes were proposed in this pull request?
    
    Special case the situation where we know the partioner and the number of 
requested partions output is the same as the current partioner to avoid a 
shuffle and instead compute distinct inside of each partion.
    
    ## How was this patch tested?
    
    New unit test that verifies partioner does not change if the partioner is 
known and distinct is called with the same target # of partions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark 
SPARK-21436-take-advantage-of-known-partioner-for-distinct-on-rdds

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22010.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22010
    
----
commit a7fbc74335c2df27002e8158f8e83a919195eed7
Author: Holden Karau <holden@...>
Date:   2018-08-06T18:04:31Z

    [SPARK-21436] Take advantage of known partioner for distinct on RDDs to 
avoid a shuffle.
    Special case the situation where we know the partioner and the number of 
requested partions output is the same as the current partioner
    to avoid a shuffle and instead compute distinct inside of each partion.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22010: [SPARK-21436] Take advantage of known partioner f...

Reply via email to