[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

jinxing64 Sat, 02 Dec 2017 23:11:07 -0800

GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/19868


    [SPARK-22676] Avoid iterating all partition paths when 
spark.sql.hive.verifyPartitionPath=true

    ## What changes were proposed in this pull request?
    
    In current code, it will scanning all partition paths when 
spark.sql.hive.verifyPartitionPath=true.
    e.g. table like below:
    ```
    CREATE TABLE `test`(
    `id` int,
    `age` int,
    `name` string)
    PARTITIONED BY (
    `A` string,
    `B` string)
    load data local inpath '/tmp/data0' into table test partition(A='00', 
B='00')
    load data local inpath '/tmp/data1' into table test partition(A='01', 
B='01')
    load data local inpath '/tmp/data2' into table test partition(A='10', 
B='10')
    load data local inpath '/tmp/data3' into table test partition(A='11', 
B='11')
    ```
    If I query with SQL â "select * from test where year=2017 and month=12 
and day=03", current code will scan all partition paths including 
'/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', 
'/data/A=11/B=11'. It costs much time and memory cost.
    
    This pr proposes to avoid iterating all partition paths. Convert  
/demo/data/year/month/day  to  /demo/data/year/month/*/ when generating path 
pattern.
    
    ## How was this patch tested?
    
    Manually test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-22676

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19868
    
----
commit 57676609faed4512291979a8d639e3be1ec80578
Author: jinxing <jinxing6...@126.com>
Date:   2017-12-03T07:07:12Z

    [SPARK-22676] Avoid iterating all partition paths when 
spark.sql.hive.verifyPartitionPath=true

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Reply via email to