[GitHub] spark pull request #19828: [SPARK-22614] Dataset API: repartitionByRange(......

adrian-ionescu Mon, 27 Nov 2017 07:52:34 -0800

GitHub user adrian-ionescu opened a pull request:

    https://github.com/apache/spark/pull/19828


    [SPARK-22614] Dataset API: repartitionByRange(...)

    ## What changes were proposed in this pull request?
    
    This PR introduces a way to explicitly range-partition a Dataset. So far, 
only round-robin and hash partitioning were possible via `df.repartition(...)`, 
but sometimes range partitioning might be desirable: e.g. when writing to disk, 
better compression without the cost of global sort.
    
    The current implementation piggybacks on the existing 
`RepartitionByExpression` `LogicalPlan` and simply adds the following logic: If 
its expressions are of type `SortOrder`, then it will do `RangePartitioning`; 
otherwise `HashPartitioning`. This is by far the least intrusive approach.
    
    I also considered:
    - adding a new `RepartitionByRange` node, but that resulted in a lot of 
code duplication, touching 10+ files
    - `RepartitionByExpression(child: LogicalPlan, partitioning: 
Partitioning)`, but that:
      - also involved touching a lot of the existing code
      - involved pulling a `catalyst.plans.physical` thing into 
`catalyst.plans.logical`
      - would have required special code for performing `Analysis` on the 
expressions under `Partitioning`, as that wouldn't happen by default anymore, 
as `Partitioning` isn't an `Expression`
    - `RepartitionByExpression(child: LogicalPlan, partitioning: 
LogicalPartitioning)`, with `trait LogicalPartitioning extends Expression with 
Unevaluable` and corresponding subclasses for Range/Hash partitioning, but
      - required a lot of new code
      - basically turned `RepartitionByExpression` into a useless wrapper over 
`LogicalPartitioning`
    
    ## How was this patch tested?
    Simple end-to-end test in `SQLQuerySuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adrian-ionescu/apache-spark repartitionByRange

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19828.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19828
    
----
commit 08527b0c5da6e7ea3dc80392b92ea88f212e5761
Author: Adrian Ionescu <[email protected]>
Date:   2017-11-27T11:54:38Z

    repartitionByRange() + repartition() with SortOrders

commit 950b3dc6429dcade98a9483d90c4f9773a6fc48e
Author: Adrian Ionescu <[email protected]>
Date:   2017-11-27T15:17:36Z

    avoid changing semantics for .repartition()

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19828: [SPARK-22614] Dataset API: repartitionByRange(......

Reply via email to