GitHub user adrian-ionescu opened a pull request:
https://github.com/apache/spark/pull/19828
[SPARK-22614] Dataset API: repartitionByRange(...)
## What changes were proposed in this pull request?
This PR introduces a way to explicitly range-partition a Dataset. So far,
only round-robin and hash partitioning were possible via `df.repartition(...)`,
but sometimes range partitioning might be desirable: e.g. when writing to disk,
better compression without the cost of global sort.
The current implementation piggybacks on the existing
`RepartitionByExpression` `LogicalPlan` and simply adds the following logic: If
its expressions are of type `SortOrder`, then it will do `RangePartitioning`;
otherwise `HashPartitioning`. This is by far the least intrusive approach.
I also considered:
- adding a new `RepartitionByRange` node, but that resulted in a lot of
code duplication, touching 10+ files
- `RepartitionByExpression(child: LogicalPlan, partitioning:
Partitioning)`, but that:
- also involved touching a lot of the existing code
- involved pulling a `catalyst.plans.physical` thing into
`catalyst.plans.logical`
- would have required special code for performing `Analysis` on the
expressions under `Partitioning`, as that wouldn't happen by default anymore,
as `Partitioning` isn't an `Expression`
- `RepartitionByExpression(child: LogicalPlan, partitioning:
LogicalPartitioning)`, with `trait LogicalPartitioning extends Expression with
Unevaluable` and corresponding subclasses for Range/Hash partitioning, but
- required a lot of new code
- basically turned `RepartitionByExpression` into a useless wrapper over
`LogicalPartitioning`
## How was this patch tested?
Simple end-to-end test in `SQLQuerySuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/adrian-ionescu/apache-spark repartitionByRange
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19828.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19828
----
commit 08527b0c5da6e7ea3dc80392b92ea88f212e5761
Author: Adrian Ionescu <[email protected]>
Date: 2017-11-27T11:54:38Z
repartitionByRange() + repartition() with SortOrders
commit 950b3dc6429dcade98a9483d90c4f9773a6fc48e
Author: Adrian Ionescu <[email protected]>
Date: 2017-11-27T15:17:36Z
avoid changing semantics for .repartition()
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]