GitHub user xuanyuanking opened a pull request:
https://github.com/apache/spark/pull/21618
[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation
time
## What changes were proposed in this pull request?
This PR change the work of getting glob path in parallel, which can make
complex wildcard path more quickly, the mainly changes in details:
1.Add new function getGlobbedPaths in DataSource, return all paths
represented by the wildcard pattern, use a local thread pool to do this while
the paths number expanded from patten lager than
`spark.sql.sources.parallelGetGlobbedPath.threshold`. The local thread pool
size controlled by `spark.sql.sources.parallelGetGlobbedPath.parallelism`
2.Add new function expandGlobPath in SparkHadoopUtil, to expand the dir
represented by the patten, here we mainly reuse the logic in
org.apache.hadoop.fs.Globber.glob().
## How was this patch tested?
Add UT in SparkHadoopUtilSuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuanyuanking/spark SPARK-20408
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21618.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21618
----
commit 12c9e8600ef922b1119939ae322467794ad39b86
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2017-09-30T09:43:30Z
Resolve conflicts with SPARK-21374
commit b552ed9caca45df71260bbb77a747f7e6d996df7
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2017-10-01T00:51:18Z
HadoopFileSystem not searializable, fix by passing SerializableConfiguration
commit eeb12d6d5c6f763f556280e7668b985091611fc8
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2017-11-13T09:04:31Z
Add UT for expandGlobPath
commit 484b5a54ac3b6d571098b4b03cb3a325e0628943
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2017-11-14T03:35:10Z
Fix ut for Seq not in expected order
commit 3c7fc22d752f7a3b0cda73e74f8a2a895e89d684
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2018-01-23T06:47:12Z
reimplement by using local thread pool
commit 1068aa92986e9d64686be007dddb11e52cc167ed
Author: Yuanjian Li <xyliyuanjian@...>
Date: 2018-06-23T05:26:22Z
Reimplement expandGlobPath method by SparkGlobber
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]