GitHub user gengliangwang opened a pull request:
https://github.com/apache/spark/pull/21004
[SPARK-23896][SQL]Improve PartitioningAwareFileIndex
## What changes were proposed in this pull request?
Currently `PartitioningAwareFileIndex` accepts an optional parameter
`userPartitionSchema`. If provided, it will combine the inferred partition
schema with the parameter.
However,
1. to get `userPartitionSchema`, we need to combine inferred partition
schema with `userSpecifiedSchema`
2. to get the inferred partition schema, we have to create a temporary file
index.
Only after that, a final version of `PartitioningAwareFileIndex` can be
created.
This can be improved by passing `userSpecifiedSchema` to
`PartitioningAwareFileIndex`.
With the improvement, we can reduce redundant code and avoid parsing the
file partition twice.
## How was this patch tested?
Unit test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gengliangwang/spark PartitioningAwareFileIndex
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21004.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21004
----
commit 35aff24743ff13ccd370a8e3747a3044e8a671c9
Author: Gengliang Wang <gengliang.wang@...>
Date: 2018-04-08T18:19:48Z
improve PartitioningAwareFileIndex
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]