GitHub user xuanyuanking opened a pull request:
https://github.com/apache/spark/pull/17702
[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation
time
## What changes were proposed in this pull request?
This PR change the work of getting glob path in parallel, which can make
complex wildcard path more quickly, the mainly changes in details:
1.Add config named `spark.sql.globPathInParallel` , default false
2.Add new function `getGlobbedPaths` in DataSource, return all paths
represented by the wildcard, in parallel or not control by the config
3.Add new function `expandGlobPath ` in SparkHadoopUtil, to expand the
first dir represented by the wildcard
## How was this patch tested?
Existing UT.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuanyuanking/spark SPARK-20408
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17702.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17702
----
commit b27ef4f9e696e2b2c2fc2e0df504baea88937234
Author: xuanyuanking <[email protected]>
Date: 2017-04-20T11:07:47Z
[SPARK-20408][SQL]Get the glob path in parallel to reduce resolve relation
time
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]