Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/9433#issuecomment-164622068
The reason the current delay scheduling algorithm doesnât work this way
is because it was considering a scenario like: a bunch of jobs are using the
same workers, so as a result, job A can only run a small number of tasks at a
time. In this scenario, it doesnât make sense to look at the time since Job
A was submitted, because there will be points where a lot of time has elapsed
just because the fairness policy dictates that job A canât run many tasks at
once. If job A has the opportunity to run a task on a non-local worker, it
probably doesnât want to âwasteâ one of the few slots itâs allowed to
use at once on the non-local executor, instead preferring to wait
spark.locality.wait to see if a local executor becomes available (for more
about this, see the delay scheduling paper:
http://elmeleegy.com/khaled/papers/delay_scheduling.pdf).
This patch is addressing a different scenario, where the amount of
concurrency the job can use is mostly limited by the locality wait, rather than
by some fairness policy. I think the ideal way to fix this problem is with the
following policy: if the task set is using less than the number of slots it
could be using (where â# slots it could be usingâ is all of the slots in
the cluster if the job is running alone, or the jobâs fair share, if itâs
not) for some period of time, increase the locality level. The current delay
scheduling policy used by Spark is essentially implementing a very simplified
version of this ideal policy, where the way it determines if the job is using
as many slots as it could be is just to see if a task has been launched
recently. This patch adds another heuristic to get closer to this ideal policy.
Iâm hesitant to merge this patch for a few reasons:
(1) It makes the scheduling policy harder to reason about, so while it will
help performance for some people who run into this case and understand how to
tune the configuration parameters, it is likely to confuse others (and possibly
introduce scheduling bugs as we try to maintain this functionality).
(2) It also doesnât implement the ideal policy, as described above.
(Although implementing the ideal policy would likely be prohibitive, especially
because we would need to do it in a backwards-compatible way.)
On the other hand, the use case this is addressing seems likely to be
fairly common, since many folks run Spark jobs in a context where only one
Spark job is running on a set of workers at a time.
@mateiz what are your thoughts on this?
As an aside, can you clarify the description of this PR to explicitly say
that youâre introducing a new set of locality timers, based on the time since
the task set was submitted?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]