[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

xuanyuanking Mon, 11 Dec 2017 18:17:54 -0800

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17702#discussion_r156256317
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -668,4 +672,31 @@ object DataSource extends Logging {
         }
         globPath
       }
    +
    +  /**
    +   * Return all paths represented by the wildcard string.
    +   * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf.
    +   */
    +  private def getGlobbedPaths(
    +      sparkSession: SparkSession,
    +      fs: FileSystem,
    +      hadoopConf: SerializableConfiguration,
    +      qualified: Path): Seq[Path] = {
    +    val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified)
    +    if (paths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
    +      SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
    +    } else {
    +      val parallelPartitionDiscoveryParallelism =
    +        
sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
    +      val numParallelism = Math.min(paths.size, 
parallelPartitionDiscoveryParallelism)
    +      val expanded = sparkSession.sparkContext
    --- End diff --
    
    @vanzin Thanks for you reply.
    ```
    Why do this using a Spark job, instead of just a local thread pool?
    ```
    As the DFS generally deployed together with NodeManagers for better data 
locality, while using client mode and driver in different region with cluster, 
using a Spark job will resolve the problem of cross region interaction in our 
scenario.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

Reply via email to