Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/17702#discussion_r156256317 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -668,4 +672,31 @@ object DataSource extends Logging { } globPath } + + /** + * Return all paths represented by the wildcard string. + * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf. + */ + private def getGlobbedPaths( + sparkSession: SparkSession, + fs: FileSystem, + hadoopConf: SerializableConfiguration, + qualified: Path): Seq[Path] = { + val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified) + if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { + SparkHadoopUtil.get.globPathIfNecessary(fs, qualified) + } else { + val parallelPartitionDiscoveryParallelism = + sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism + val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism) + val expanded = sparkSession.sparkContext --- End diff -- @vanzin Thanks for you reply. ``` Why do this using a Spark job, instead of just a local thread pool? ``` As the DFS generally deployed together with NodeManagers for better data locality, while using client mode and driver in different region with cluster, using a Spark job will resolve the problem of cross region interaction in our scenario.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org