[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

xuanyuanking Fri, 22 Jun 2018 22:34:06 -0700

GitHub user xuanyuanking opened a pull request:

    https://github.com/apache/spark/pull/21618


    [SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation 
time

    ## What changes were proposed in this pull request?
    
    This PR change the work of getting glob path in parallel, which can make 
complex wildcard path more quickly, the mainly changes in details:
    1.Add new function getGlobbedPaths in DataSource, return all paths 
represented by the wildcard pattern, use a local thread pool to do this while 
the paths number expanded from patten lager than 
`spark.sql.sources.parallelGetGlobbedPath.threshold`. The local thread pool 
size controlled by `spark.sql.sources.parallelGetGlobbedPath.parallelism`
    2.Add new function expandGlobPath in SparkHadoopUtil, to expand the dir 
represented by the patten, here we mainly reuse the logic in 
org.apache.hadoop.fs.Globber.glob().
    
    ## How was this patch tested?
    
    Add UT in SparkHadoopUtilSuite.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuanyuanking/spark SPARK-20408

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21618
    
----
commit 12c9e8600ef922b1119939ae322467794ad39b86
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2017-09-30T09:43:30Z

    Resolve conflicts with SPARK-21374

commit b552ed9caca45df71260bbb77a747f7e6d996df7
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2017-10-01T00:51:18Z

    HadoopFileSystem not searializable, fix by passing SerializableConfiguration

commit eeb12d6d5c6f763f556280e7668b985091611fc8
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2017-11-13T09:04:31Z

    Add UT for expandGlobPath

commit 484b5a54ac3b6d571098b4b03cb3a325e0628943
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2017-11-14T03:35:10Z

    Fix ut for Seq not in expected order

commit 3c7fc22d752f7a3b0cda73e74f8a2a895e89d684
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2018-01-23T06:47:12Z

    reimplement by using local thread pool

commit 1068aa92986e9d64686be007dddb11e52cc167ed
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2018-06-23T05:26:22Z

    Reimplement expandGlobPath method by SparkGlobber

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

Reply via email to