[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

xuanyuanking Mon, 11 Dec 2017 19:38:34 -0800

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17702#discussion_r156265041
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -668,4 +672,31 @@ object DataSource extends Logging {
         }
         globPath
       }
    +
    +  /**
    +   * Return all paths represented by the wildcard string.
    +   * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf.
    +   */
    +  private def getGlobbedPaths(
    +      sparkSession: SparkSession,
    +      fs: FileSystem,
    +      hadoopConf: SerializableConfiguration,
    +      qualified: Path): Seq[Path] = {
    +    val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified)
    +    if (paths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
    +      SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
    +    } else {
    +      val parallelPartitionDiscoveryParallelism =
    +        
sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
    +      val numParallelism = Math.min(paths.size, 
parallelPartitionDiscoveryParallelism)
    +      val expanded = sparkSession.sparkContext
    --- End diff --
    
    Yep, I means YARN and HDFS always deploy in same region, but driver we 
can't control because it's our customer's machine in client mode like spark sql 
or shell.
    For example we deploy YARN and HDFS in Beijing CN, user use spark sql on 
Shanghai CN. 
    Maybe this scenario shouldn't consider in this patch? What's your opinion 
@vanzin



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

Reply via email to