[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

xuanyuanking Mon, 25 Jun 2018 08:20:28 -0700

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21618#discussion_r197836518
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -724,4 +726,35 @@ object DataSource extends Logging {
              """.stripMargin)
         }
       }
    +
    +  /**
    +   * Return all paths represented by the wildcard string.
    +   * Use a local thread pool to do this while there's too many paths.
    +   */
    +  private def getGlobbedPaths(
    +      sparkSession: SparkSession,
    +      fs: FileSystem,
    +      hadoopConf: Configuration,
    +      qualified: Path): Seq[Path] = {
    +    val getGlobbedPathThreshold = 
sparkSession.sessionState.conf.parallelGetGlobbedPathThreshold
    +    val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified, 
getGlobbedPathThreshold)
    --- End diff --
    
    em...It's hard to achieve this in currently implement, maybe use a bool 
config like `spark.sql.sources.parallelGetGlobbedPath.enable` to control this 
and previous code path here? Maybe safer under your concerns.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

Reply via email to