[ https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves reassigned SPARK-21656: ------------------------------------- Assignee: Jong Yoon Lee > spark dynamic allocation should not idle timeout executors when there are > enough tasks to run on them > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-21656 > URL: https://issues.apache.org/jira/browse/SPARK-21656 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.1 > Reporter: Jong Yoon Lee > Assignee: Jong Yoon Lee > Fix For: 2.2.1, 2.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Right now with dynamic allocation spark starts by getting the number of > executors it needs to run all the tasks in parallel (or the configured > maximum) for that stage. After it gets that number it will never reacquire > more unless either an executor dies, is explicitly killed by yarn or it goes > to the next stage. The dynamic allocation manager has the concept of idle > timeout. Currently this says if a task hasn't been scheduled on that executor > for a configurable amount of time (60 seconds by default), then let that > executor go. Note when it lets that executor go due to the idle timeout it > never goes back to see if it should reacquire more. > This is a problem for multiple reasons: > 1 . Things can happen in the system that are not expected that can cause > delays. Spark should be resilient to these. If the driver is GC'ing, you have > network delays, etc we could idle timeout executors even though there are > tasks to run on them its just the scheduler hasn't had time to start those > tasks. Note that in the worst case this allows the number of executors to go > to 0 and we have a deadlock. > 2. Internal Spark components have opposing requirements. The scheduler has a > requirement to try to get locality, the dynamic allocation doesn't know about > this and if it lets the executors go it hurts the scheduler from doing what > it was designed to do. For example the scheduler first tries to schedule > node local, during this time it can skip scheduling on some executors. After > a while though the scheduler falls back from node local to scheduler on rack > local, and then eventually on any node. So during when the scheduler is > doing node local scheduling, the other executors can idle timeout. This > means that when the scheduler does fall back to rack or any locality where it > would have used those executors, we have already let them go and it can't > scheduler all the tasks it could which can have a huge negative impact on job > run time. > > In both of these cases when the executors idle timeout we never go back to > check to see if we need more executors (until the next stage starts). In the > worst case you end up with 0 and deadlock, but generally this shows itself by > just going down to very few executors when you could have 10's of thousands > of tasks to run on them, which causes the job to take way more time (in my > case I've seen it should take minutes and it takes hours due to only been > left a few executors). > We should handle these situations in Spark. The most straight forward > approach would be to not allow the executors to idle timeout when there are > tasks that could run on those executors. This would allow the scheduler to do > its job with locality scheduling. In doing this it also fixes number 1 above > because you never can go into a deadlock as it will keep enough executors to > run all the tasks on. > There are other approaches to fix this, like explicitly prevent it from going > to 0 executors, that prevents a deadlock but can still cause the job to > slowdown greatly. We could also change it at some point to just re-check to > see if we should get more executors, but this adds extra logic, we would have > to decide when to check, its also just overhead in letting them go and then > re-acquiring them again and this would cause some slowdown in the job as the > executors aren't immediately there for the scheduler to place things on. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org