GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21782
[SPARK-24816][SQL] SQL interface support repartitionByRange ## What changes were proposed in this pull request? SQL interface support `repartitionByRange` to improvement data pushdown. ## How was this patch tested? manual tests and benchmark tests. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428). The test SQL: ```sql select * from table where id=401564838907 ``` The test result: Mode | Input Size | Records | Total Time | Duration | Prepare data Resource Allocation MB-seconds -- | -- | -- | -- | -- | -- default | 959.2 GB | 237624395522 | 11.2 h | 1.3 min | 6496280086 DISTRIBUTE BY | 970.8 GB | 244642791213 | 11.4 h | 1.3 min | 10536069846 SORT BY | 456.3 GB | 101587838784 | 5.4 h | 31 s | 8965158620 DISTRIBUTE BY + SORT BY | 219.0 GB | 51723521593 | 3.3 h | 54 s | 12552656774 RANGE PARTITION BY | 38.5 GB | 75355144 | 45 min | 13 s | 14525275297 RANGE PARTITION BY + SORT BY | 17.4 GB | 14334724 | 45 min | 12 s | 16255296698 Itâs obvious that after the data repartitioned by `RANGE PARTITION BY id SORT BY id`, only need to read 17.4 GB data can filtered `id=401564838907`, and the performance is greatly improved. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24816 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21782 ---- commit 3e3bbada6e9b02f4fd5d8db216bdc2ce4a397d12 Author: Yuming Wang <yumwang@...> Date: 2018-07-16T09:06:20Z SQL interface support RANGE PARTITION ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org