[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21656:
----------------------------------
    Description: 
Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer then the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.

There are multiple reasons:
1 . Things can happen in the system that are not expected that can cause 
delays. Spark should be resilient to these. If the driver is GC'ing, you have 
network delays, etc we could idle timeout executors even though there are tasks 
to run on them its just the scheduler hasn't had time to start those tasks. 
these just slow down the users job, the user does not want this.

2. Internal Spark components have opposing requirements. The scheduler has a 
requirement to try to get locality, the dynamic allocation doesn't know about 
this and it giving away executors it hurting the scheduler from doing what it 
was designed to do.
Ideally we have enough executors to run all the tasks on. If dynamic allocation 
allows those to idle timeout the scheduler can not make proper decisions. In 
the end this hurts users by affects the job. A user should not have to mess 
with the configs to keep this basic behavior.

  was:
Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer then the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.


> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-21656
>                 URL: https://issues.apache.org/jira/browse/SPARK-21656
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.
> There are multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks. these just slow down the users job, the user does not want this.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and it giving away executors it hurting the scheduler from doing what it 
> was designed to do.
> Ideally we have enough executors to run all the tasks on. If dynamic 
> allocation allows those to idle timeout the scheduler can not make proper 
> decisions. In the end this hurts users by affects the job. A user should not 
> have to mess with the configs to keep this basic behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to