[ 
https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YUNIKORN-2929:
--------------------------------
    Description: 
 

When running Spark applications, if an executor pod fails to find a suitable 
node, it is likely that subsequent executor pods will also fail to find nodes. 
This is particularly problematic when the application has a toleration for a 
specific taint and there are limited nodes with that taint. The scheduler 
spends excessive time attempting to allocate pods, ultimately resulting in no 
pods being bound to nodes.

To optimize scheduling, we should:
 # Implement a check to determine if previous pods in the same application were 
successfully allocated.
 # Skip processing other pods in the application if previous pods failed to 
allocate.
 # Generalize this by:
 ** Adding an immediate action for Spark applications.
 ** Introducing a threshold ('n' number of pods) after which the scheduler will 
stop trying and restart the scheduling cycle.

  was:
When running Spark applications, if an executor pod fails to find a suitable 
node, it is likely that subsequent executor pods will also fail to find nodes. 
This is particularly problematic when the application has a toleration for a 
specific taint and there are limited nodes with that taint. The scheduler 
spends excessive time attempting to allocate pods, ultimately resulting in no 
pods being bound to nodes.

To optimize scheduling, we should:
 # Implement a check to determine if previous pods in the same application were 
successfully allocated.
 # Skip processing other pods in the application if previous pods failed to 
allocate.
 # Generalize this by:
 ** Adding an immediate action for Spark applications.
 ** Introducing a threshold ('n' number of pods) after which the scheduler will 
stop trying and restart the scheduling cycle.


> Skip allocation attempts for subsequent pods in an application if previous 
> pods have failed to allocate
> -------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2929
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2929
>             Project: Apache YuniKorn
>          Issue Type: Task
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>
>  
> When running Spark applications, if an executor pod fails to find a suitable 
> node, it is likely that subsequent executor pods will also fail to find 
> nodes. This is particularly problematic when the application has a toleration 
> for a specific taint and there are limited nodes with that taint. The 
> scheduler spends excessive time attempting to allocate pods, ultimately 
> resulting in no pods being bound to nodes.
> To optimize scheduling, we should:
>  # Implement a check to determine if previous pods in the same application 
> were successfully allocated.
>  # Skip processing other pods in the application if previous pods failed to 
> allocate.
>  # Generalize this by:
>  ** Adding an immediate action for Spark applications.
>  ** Introducing a threshold ('n' number of pods) after which the scheduler 
> will stop trying and restart the scheduling cycle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to