GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/22010
[SPARK-21436] Take advantage of known partioner for distinct on RDDs to
avoid a shuffle
## What changes were proposed in this pull request?
Special case the situation where we know the partioner and the number of
requested partions output is the same as the current partioner to avoid a
shuffle and instead compute distinct inside of each partion.
## How was this patch tested?
New unit test that verifies partioner does not change if the partioner is
known and distinct is called with the same target # of partions.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark
SPARK-21436-take-advantage-of-known-partioner-for-distinct-on-rdds
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22010.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22010
----
commit a7fbc74335c2df27002e8158f8e83a919195eed7
Author: Holden Karau <holden@...>
Date: 2018-08-06T18:04:31Z
[SPARK-21436] Take advantage of known partioner for distinct on RDDs to
avoid a shuffle.
Special case the situation where we know the partioner and the number of
requested partions output is the same as the current partioner
to avoid a shuffle and instead compute distinct inside of each partion.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]