[
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295282#comment-16295282
]
Nan Zhu commented on SPARK-21656:
---------------------------------
NOTE: the issue fixed by https://github.com/apache/spark/pull/18874
> spark dynamic allocation should not idle timeout executors when there are
> enough tasks to run on them
> -----------------------------------------------------------------------------------------------------
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.1
> Reporter: Jong Yoon Lee
> Assignee: Jong Yoon Lee
> Fix For: 2.2.1, 2.3.0
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of
> executors it needs to run all the tasks in parallel (or the configured
> maximum) for that stage. After it gets that number it will never reacquire
> more unless either an executor dies, is explicitly killed by yarn or it goes
> to the next stage. The dynamic allocation manager has the concept of idle
> timeout. Currently this says if a task hasn't been scheduled on that executor
> for a configurable amount of time (60 seconds by default), then let that
> executor go. Note when it lets that executor go due to the idle timeout it
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause
> delays. Spark should be resilient to these. If the driver is GC'ing, you have
> network delays, etc we could idle timeout executors even though there are
> tasks to run on them its just the scheduler hasn't had time to start those
> tasks. Note that in the worst case this allows the number of executors to go
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a
> requirement to try to get locality, the dynamic allocation doesn't know about
> this and if it lets the executors go it hurts the scheduler from doing what
> it was designed to do. For example the scheduler first tries to schedule
> node local, during this time it can skip scheduling on some executors. After
> a while though the scheduler falls back from node local to scheduler on rack
> local, and then eventually on any node. So during when the scheduler is
> doing node local scheduling, the other executors can idle timeout. This
> means that when the scheduler does fall back to rack or any locality where it
> would have used those executors, we have already let them go and it can't
> scheduler all the tasks it could which can have a huge negative impact on job
> run time.
>
> In both of these cases when the executors idle timeout we never go back to
> check to see if we need more executors (until the next stage starts). In the
> worst case you end up with 0 and deadlock, but generally this shows itself by
> just going down to very few executors when you could have 10's of thousands
> of tasks to run on them, which causes the job to take way more time (in my
> case I've seen it should take minutes and it takes hours due to only been
> left a few executors).
> We should handle these situations in Spark. The most straight forward
> approach would be to not allow the executors to idle timeout when there are
> tasks that could run on those executors. This would allow the scheduler to do
> its job with locality scheduling. In doing this it also fixes number 1 above
> because you never can go into a deadlock as it will keep enough executors to
> run all the tasks on.
> There are other approaches to fix this, like explicitly prevent it from going
> to 0 executors, that prevents a deadlock but can still cause the job to
> slowdown greatly. We could also change it at some point to just re-check to
> see if we should get more executors, but this adds extra logic, we would have
> to decide when to check, its also just overhead in letting them go and then
> re-acquiring them again and this would cause some slowdown in the job as the
> executors aren't immediately there for the scheduler to place things on.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]