[GitHub] spark pull request #21782: [SPARK-24816][SQL] SQL interface support repartit...

wangyum Mon, 16 Jul 2018 02:22:37 -0700

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/21782


    [SPARK-24816][SQL] SQL interface support repartitionByRange

    ## What changes were proposed in this pull request?
    
    SQL interface support `repartitionByRange` to improvement data pushdown.
    
    ## How was this patch tested?
    
    manual tests and benchmark tests.
    
    I have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428).
    The test SQL:
    ```sql
    select * from table where id=401564838907
    ```
    The test result:
    
    Mode | Input Size | Records | Total Time | Duration | Prepare data Resource 
Allocation   MB-seconds
    -- | -- | -- | -- | -- | --
    default | 959.2 GB | 237624395522 | 11.2 h | 1.3 min | 6496280086
    DISTRIBUTE BY | 970.8 GB | 244642791213 | 11.4 h | 1.3 min | 10536069846
    SORT BY | 456.3 GB | 101587838784 | 5.4 h | 31 s | 8965158620
    DISTRIBUTE BY + SORT BY | 219.0 GB | 51723521593 | 3.3 h | 54 s | 
12552656774
    RANGE PARTITION BY | 38.5 GB | 75355144 | 45 min | 13 s | 14525275297
    RANGE PARTITION BY + SORT BY | 17.4 GB | 14334724 | 45 min | 12 s | 
16255296698
    
    Itâs obvious that after the data repartitioned by `RANGE PARTITION BY id 
SORT BY id`, only need to read 17.4 GB data can filtered `id=401564838907`, and 
the performance is greatly improved.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-24816

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21782
    
----
commit 3e3bbada6e9b02f4fd5d8db216bdc2ce4a397d12
Author: Yuming Wang <yumwang@...>
Date:   2018-07-16T09:06:20Z

    SQL interface support RANGE PARTITION

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21782: [SPARK-24816][SQL] SQL interface support repartit...

Reply via email to