squito commented on a change in pull request #23951: [SPARK-27038][CORE][YARN] 
Re-implement RackResolver to reduce resolving time
URL: https://github.com/apache/spark/pull/23951#discussion_r263450412
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 ##########
 @@ -184,11 +184,23 @@ private[spark] class TaskSetManager(
     t.epoch = epoch
   }
 
+  // An array to store preferred location and its task index
+  private val locationWithTaskIndex: ArrayBuffer[(String, Int)] = new 
ArrayBuffer[(String, Int)]()
+  private val addTaskStartTime = System.nanoTime()
   // Add all our tasks to the pending lists. We do this in reverse order
   // of task index so that tasks with low indices get launched first.
   for (i <- (0 until numTasks).reverse) {
-    addPendingTask(i)
+    addPendingTask(i, true)
   }
+  // Convert preferred location list to rack list in one invocation and zip 
with the origin index
+  private val rackWithTaskIndex = 
sched.getRacksForHosts(locationWithTaskIndex.map(_._1).toList)
 
 Review comment:
   would it be worth doing some de-duping here?  It would not be unusual to 
have a taskset with 10K tasks on 100 hosts.  I see that 
`CachedDNSToSwitchMapping` will cache the lookup itself, but doesn't seem 
necessary to create those intermediate data structures with 10K elements.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to