GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/21782
[SPARK-24816][SQL] SQL interface support repartitionByRange
## What changes were proposed in this pull request?
SQL interface support `repartitionByRange` to improvement data pushdown.
## How was this patch tested?
manual tests and benchmark tests.
I have test this feature with a big table(data size: 1.1 T, row count:
282,001,954,428).
The test SQL:
```sql
select * from table where id=401564838907
```
The test result:
Mode | Input Size | Records | Total Time | Duration | Prepare data Resource
Allocation MB-seconds
-- | -- | -- | -- | -- | --
default | 959.2 GB | 237624395522 | 11.2 h | 1.3 min | 6496280086
DISTRIBUTE BY | 970.8 GB | 244642791213 | 11.4 h | 1.3 min | 10536069846
SORT BY | 456.3 GB | 101587838784 | 5.4 h | 31 s | 8965158620
DISTRIBUTE BY + SORT BY | 219.0 GB | 51723521593 | 3.3 h | 54 s |
12552656774
RANGE PARTITION BY | 38.5 GB | 75355144 | 45 min | 13 s | 14525275297
RANGE PARTITION BY + SORT BY | 17.4 GB | 14334724 | 45 min | 12 s |
16255296698
Itâs obvious that after the data repartitioned by `RANGE PARTITION BY id
SORT BY id`, only need to read 17.4 GB data can filtered `id=401564838907`, and
the performance is greatly improved.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-24816
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21782.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21782
----
commit 3e3bbada6e9b02f4fd5d8db216bdc2ce4a397d12
Author: Yuming Wang <yumwang@...>
Date: 2018-07-16T09:06:20Z
SQL interface support RANGE PARTITION
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]