GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/21618
[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time ## What changes were proposed in this pull request? This PR change the work of getting glob path in parallel, which can make complex wildcard path more quickly, the mainly changes in details: 1.Add new function getGlobbedPaths in DataSource, return all paths represented by the wildcard pattern, use a local thread pool to do this while the paths number expanded from patten lager than `spark.sql.sources.parallelGetGlobbedPath.threshold`. The local thread pool size controlled by `spark.sql.sources.parallelGetGlobbedPath.parallelism` 2.Add new function expandGlobPath in SparkHadoopUtil, to expand the dir represented by the patten, here we mainly reuse the logic in org.apache.hadoop.fs.Globber.glob(). ## How was this patch tested? Add UT in SparkHadoopUtilSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-20408 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21618 ---- commit 12c9e8600ef922b1119939ae322467794ad39b86 Author: Yuanjian Li <xyliyuanjian@...> Date: 2017-09-30T09:43:30Z Resolve conflicts with SPARK-21374 commit b552ed9caca45df71260bbb77a747f7e6d996df7 Author: Yuanjian Li <xyliyuanjian@...> Date: 2017-10-01T00:51:18Z HadoopFileSystem not searializable, fix by passing SerializableConfiguration commit eeb12d6d5c6f763f556280e7668b985091611fc8 Author: Yuanjian Li <xyliyuanjian@...> Date: 2017-11-13T09:04:31Z Add UT for expandGlobPath commit 484b5a54ac3b6d571098b4b03cb3a325e0628943 Author: Yuanjian Li <xyliyuanjian@...> Date: 2017-11-14T03:35:10Z Fix ut for Seq not in expected order commit 3c7fc22d752f7a3b0cda73e74f8a2a895e89d684 Author: Yuanjian Li <xyliyuanjian@...> Date: 2018-01-23T06:47:12Z reimplement by using local thread pool commit 1068aa92986e9d64686be007dddb11e52cc167ed Author: Yuanjian Li <xyliyuanjian@...> Date: 2018-06-23T05:26:22Z Reimplement expandGlobPath method by SparkGlobber ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org