Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/17702#discussion_r156438780
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
---
@@ -668,4 +672,31 @@ object DataSource extends Logging {
}
globPath
}
+
+ /**
+ * Return all paths represented by the wildcard string.
+ * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf.
+ */
+ private def getGlobbedPaths(
+ sparkSession: SparkSession,
+ fs: FileSystem,
+ hadoopConf: SerializableConfiguration,
+ qualified: Path): Seq[Path] = {
+ val paths = SparkHadoopUtil.get.expandGlobPath(fs, qualified)
+ if (paths.size <=
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
+ SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
+ } else {
+ val parallelPartitionDiscoveryParallelism =
+
sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
+ val numParallelism = Math.min(paths.size,
parallelPartitionDiscoveryParallelism)
+ val expanded = sparkSession.sparkContext
--- End diff --
Where the driver is deployed is not of concern; first because, as I said,
this is not about latency, but parallelizing multiple calls to the NN. Second,
because if your driver is in a different network, it will have the same latency
when talking to the executors as if it would talk directly to the NN, so you're
not fixing anything, you're just adding an extra hop (= more latency).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]