Github user habren commented on a diff in the pull request:
https://github.com/apache/spark/pull/22018#discussion_r208788059
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
---
@@ -297,7 +297,7 @@ object InMemoryFileIndex extends Logging {
val missingFiles = mutable.ArrayBuffer.empty[String]
val filteredLeafStatuses = allLeafStatuses.filterNot(
status => shouldFilterOut(status.getPath.getName))
- val resolvedLeafStatuses = filteredLeafStatuses.flatMap {
+ val resolvedLeafStatuses = filteredLeafStatuses.par.flatMap {
--- End diff --
Thanks @maropu for your comments. I updated the title and description.
Let's explain the difference between this change and the current parallel
partition discovery. The current one will discovery different partitions in
parallel. This change will get the block location for a single partition in
parallel. When there is only a few partitions and each contains tons of
thousands of files, the current partition discovery won't help. And this change
can accelerate it in this case
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]