[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

vanzin Tue, 12 Dec 2017 09:29:45 -0800

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17702#discussion_r156438780
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -668,4 +672,31 @@ object DataSource extends Logging {
         }
         globPath
       }
    +
    +  /**
    +   * Return all paths represented by the wildcard string.
    +   * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf.
    +   */
    +  private def getGlobbedPaths(
    +      sparkSession: SparkSession,
    +      fs: FileSystem,
    +      hadoopConf: SerializableConfiguration,
    +      qualified: Path): Seq[Path] = {
    +    val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified)
    +    if (paths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
    +      SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
    +    } else {
    +      val parallelPartitionDiscoveryParallelism =
    +        
sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
    +      val numParallelism = Math.min(paths.size, 
parallelPartitionDiscoveryParallelism)
    +      val expanded = sparkSession.sparkContext
    --- End diff --
    
    Where the driver is deployed is not of concern; first because, as I said, 
this is not about latency, but parallelizing multiple calls to the NN. Second, 
because if your driver is in a different network, it will have the same latency 
when talking to the executors as if it would talk directly to the NN, so you're 
not fixing anything, you're just adding an extra hop (= more latency).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

Reply via email to