GitHub user sujithjay opened a pull request:
https://github.com/apache/spark/pull/20002
[SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner
## What changes were proposed in this pull request?
In choosing a Partitioner to use for a cogroup-like operation between a
number of RDDs, the default behaviour was if some of the RDDs already have a
partitioner, we choose the one amongst them with the maximum number of
partitions.
This behaviour, in some cases, could hit the 2G limit (SPARK-6235). To
illustrate one such scenario, consider two RDDs:
rDD1: with smaller data and smaller number of partitions, alongwith a
Partitioner.
rDD2: with much larger data and a larger number of partitions, without a
Partitioner.
The cogroup of these two RDDs could hit the 2G limit, as a larger amount of
data is shuffled into a smaller number of partitions.
This PR introduces a safety-check wherein the Partitioner is chosen only if
either of the following conditions are met:
1. if the number of partitions of the RDD associated with the Partitioner
is greater than or equal to the max number of upstream partitions; or
2. if the number of partitions of the RDD associated with the Partitioner
is less than and within a single order of magnitude of the max number of
upstream partitions.
## How was this patch tested?
Unit tests in PartitioningSuite and PairRDDFunctionsSuite
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sujithjay/spark SPARK-22465
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20002.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20002
----
commit 176270b3dbddb1f8d1330709dfea2022eebb7a11
Author: sujithjay <[email protected]>
Date: 2017-12-16T12:16:13Z
[SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner
that ignores existing Partitioners, if they are more than a single order
of magnitude smaller than the max number of upstream partitions
commit be391a78db920f944ce2fe1223dd604aae56871a
Author: sujithjay <[email protected]>
Date: 2017-12-16T12:22:41Z
Merge remote-tracking branch 'origin-apache/master' into SPARK-22465
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]