GitHub user sujithjay opened a pull request:

    https://github.com/apache/spark/pull/20002

    [SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner

    ## What changes were proposed in this pull request?
    In choosing a Partitioner to use for a cogroup-like operation between a 
number of RDDs, the default behaviour was if some of the RDDs already have a 
partitioner, we choose the one amongst them with the maximum number of 
partitions.
    
    This behaviour, in some cases, could hit the 2G limit (SPARK-6235). To 
illustrate one such scenario, consider two RDDs:
    rDD1: with smaller data and smaller number of partitions, alongwith a 
Partitioner.
    rDD2: with much larger data and a larger number of partitions, without a 
Partitioner.
    
    The cogroup of these two RDDs could hit the 2G limit, as a larger amount of 
data is shuffled into a smaller number of partitions.
    
    This PR introduces a safety-check wherein the Partitioner is chosen only if 
either of the following conditions are met:
    1. if the number of partitions of the RDD associated with the Partitioner 
is greater than or equal to the max number of upstream partitions; or 
    2. if the number of partitions of the RDD associated with the Partitioner 
is less than and within a single order of magnitude of the max number of 
upstream partitions.
    
    ## How was this patch tested?
    Unit tests in PartitioningSuite and PairRDDFunctionsSuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sujithjay/spark SPARK-22465

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20002.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20002
    
----
commit 176270b3dbddb1f8d1330709dfea2022eebb7a11
Author: sujithjay <[email protected]>
Date:   2017-12-16T12:16:13Z

    [SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner
    
     that ignores existing Partitioners, if they are more than a single order 
of magnitude smaller than the max number of upstream partitions

commit be391a78db920f944ce2fe1223dd604aae56871a
Author: sujithjay <[email protected]>
Date:   2017-12-16T12:22:41Z

    Merge remote-tracking branch 'origin-apache/master' into SPARK-22465

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to