[GitHub] spark pull request #22018: [SPARK-25038][SQL] Get block location in parallel

habren Wed, 08 Aug 2018 19:03:51 -0700

Github user habren commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22018#discussion_r208788059
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 ---
    @@ -297,7 +297,7 @@ object InMemoryFileIndex extends Logging {
         val missingFiles = mutable.ArrayBuffer.empty[String]
         val filteredLeafStatuses = allLeafStatuses.filterNot(
           status => shouldFilterOut(status.getPath.getName))
    -    val resolvedLeafStatuses = filteredLeafStatuses.flatMap {
    +    val resolvedLeafStatuses = filteredLeafStatuses.par.flatMap {
    --- End diff --
    
    Thanks @maropu for your comments. I updated the title and description. 
Let's explain the difference between this change and the current parallel 
partition discovery. The current one will discovery different partitions in 
parallel. This change will get the block location for a single partition in 
parallel. When there is only a few partitions and each contains tons of 
thousands of files, the current partition discovery won't help. And this change 
can accelerate it in this case



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22018: [SPARK-25038][SQL] Get block location in parallel

Reply via email to