GitHub user jinxing64 opened a pull request:
https://github.com/apache/spark/pull/19868
[SPARK-22676] Avoid iterating all partition paths when
spark.sql.hive.verifyPartitionPath=true
## What changes were proposed in this pull request?
In current code, it will scanning all partition paths when
spark.sql.hive.verifyPartitionPath=true.
e.g. table like below:
```
CREATE TABLE `test`(
`id` int,
`age` int,
`name` string)
PARTITIONED BY (
`A` string,
`B` string)
load data local inpath '/tmp/data0' into table test partition(A='00',
B='00')
load data local inpath '/tmp/data1' into table test partition(A='01',
B='01')
load data local inpath '/tmp/data2' into table test partition(A='10',
B='10')
load data local inpath '/tmp/data3' into table test partition(A='11',
B='11')
```
If I query with SQL â "select * from test where year=2017 and month=12
and day=03", current code will scan all partition paths including
'/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10',
'/data/A=11/B=11'. It costs much time and memory cost.
This pr proposes to avoid iterating all partition paths. Convert
/demo/data/year/month/day to /demo/data/year/month/*/ when generating path
pattern.
## How was this patch tested?
Manually test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jinxing64/spark SPARK-22676
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19868.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19868
----
commit 57676609faed4512291979a8d639e3be1ec80578
Author: jinxing <[email protected]>
Date: 2017-12-03T07:07:12Z
[SPARK-22676] Avoid iterating all partition paths when
spark.sql.hive.verifyPartitionPath=true
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]