[GitHub] spark pull request #16376: [SPARK-18967][SCHEDULER] compute locality levels ...

squito Thu, 05 Jan 2017 11:34:37 -0800

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16376#discussion_r94835320
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -159,7 +159,12 @@ private[spark] class TaskSetManager(
         addPendingTask(i)
       }
     
    -  // Figure out which locality levels we have in our TaskSet, so we can do 
delay scheduling
    +  /**
    +   * Track the set of locality levels which are valid given the tasks 
locality preferences and
    +   * the set of currently available executors.  This is updated as 
executors are added and removed.
    +   * This allows a performance optimization, of skipping levels that 
aren't relevant (eg., skip
    +   * PROCESS_LOCAL if no tasks could be run PROCESS_LOCAL for the current 
set of executors).
    +   */
       var myLocalityLevels = computeValidLocalityLevels()
    --- End diff --
    
    As I was figuring out the purpose of this for what to put in the comment, I 
made a couple of observations:
    
    1) For each executor we add or remove, its an O(numExecutors) operation to 
update the locality levels.  So overall its an O(numExecutors^2) to add a 
bunch.  Minor on small clusters, but I wonder if this is an issue when you're 
using dynamic allocation and going up and down to 1000s of executors.  Its all 
happening with a lock on the `TaskSchedulerImpl` too.
    
    2) Though we recompute valid locality levels as executors come and go, we 
do *not* as tasks complete.  That's not a problem -- as offers come in, we 
still go through the right task lists.  But it does make me wonder whether this 
business of updating the locality levels for the current set of executors is 
useful, and instead we should just always use all levels.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16376: [SPARK-18967][SCHEDULER] compute locality levels ...

Reply via email to