GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/19868
[SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true ## What changes were proposed in this pull request? In current code, it will scanning all partition paths when spark.sql.hive.verifyPartitionPath=true. e.g. table like below: ``` CREATE TABLE `test`( `id` int, `age` int, `name` string) PARTITIONED BY ( `A` string, `B` string) load data local inpath '/tmp/data0' into table test partition(A='00', B='00') load data local inpath '/tmp/data1' into table test partition(A='01', B='01') load data local inpath '/tmp/data2' into table test partition(A='10', B='10') load data local inpath '/tmp/data3' into table test partition(A='11', B='11') ``` If I query with SQL â "select * from test where year=2017 and month=12 and day=03", current code will scan all partition paths including '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', '/data/A=11/B=11'. It costs much time and memory cost. This pr proposes to avoid iterating all partition paths. Convert /demo/data/year/month/day to /demo/data/year/month/*/ when generating path pattern. ## How was this patch tested? Manually test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-22676 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19868.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19868 ---- commit 57676609faed4512291979a8d639e3be1ec80578 Author: jinxing <jinxing6...@126.com> Date: 2017-12-03T07:07:12Z [SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org