Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/9433#issuecomment-164622068
  
    The reason the current delay scheduling algorithm doesn’t work this way 
is because it was considering a scenario like: a bunch of jobs are using the 
same workers, so as a result, job A can only run a small number of tasks at a 
time.  In this scenario, it doesn’t make sense to look at the time since Job 
A was submitted, because there will be points where a lot of time has elapsed 
just because the fairness policy dictates that job A can’t run many tasks at 
once.  If job A has the opportunity to run a task on a non-local worker, it 
probably doesn’t want to “waste” one of the few slots it’s allowed to 
use at once on the non-local executor, instead preferring to wait 
spark.locality.wait to see if a local executor becomes available (for more 
about this, see the delay scheduling paper: 
http://elmeleegy.com/khaled/papers/delay_scheduling.pdf).
    
    This patch is addressing a different scenario, where the amount of 
concurrency the job can use is mostly limited by the locality wait, rather than 
by some fairness policy.  I think the ideal way to fix this problem is with the 
following policy: if the task set is using less than the number of slots it 
could be using (where “# slots it could be using” is all of the slots in 
the cluster if the job is running alone, or the job’s fair share, if it’s 
not) for some period of time, increase the locality level.  The current delay 
scheduling policy used by Spark is essentially implementing a very simplified 
version of this ideal policy, where the way it determines if the job is using 
as many slots as it could be is just to see if a task has been launched 
recently.  This patch adds another heuristic to get closer to this ideal policy.
    
    I’m hesitant to merge this patch for a few reasons:
    (1) It makes the scheduling policy harder to reason about, so while it will 
help performance for some people who run into this case and understand how to 
tune the configuration parameters, it is likely to confuse others (and possibly 
introduce scheduling bugs as we try to maintain this functionality).
    (2) It also doesn’t implement the ideal policy, as described above. 
(Although implementing the ideal policy would likely be prohibitive, especially 
because we would need to do it in a backwards-compatible way.)
    
    On the other hand, the use case this is addressing seems likely to be 
fairly common, since many folks run Spark jobs in a context where only one 
Spark job is running on a set of workers at a time.
    
    @mateiz what are your thoughts on this?
    
    As an aside, can you clarify the description of this PR to explicitly say 
that you’re introducing a new set of locality timers, based on the time since 
the task set was submitted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to