[
https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Craig Condit updated YUNIKORN-2929:
-----------------------------------
Target Version: (was: 1.7.0)
> Implement Skip Allocation Check for Unsuccessful Pods
> ------------------------------------------------------
>
> Key: YUNIKORN-2929
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2929
> Project: Apache YuniKorn
> Issue Type: Task
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
>
> Skip allocation attempts for subsequent pods in an application if previous
> pods have failed to allocate.
> When running Spark applications, if an executor pod fails to find a suitable
> node, it is likely that subsequent executor pods will also fail to find
> nodes. This is particularly problematic when the application has a toleration
> for a specific taint and there are limited nodes with that taint. The
> scheduler spends excessive time attempting to allocate pods, ultimately
> resulting in no pods being bound to nodes.
> To optimize scheduling, we should:
> # Implement a check to determine if previous pods in the same application
> were successfully allocated.
> # Skip processing other pods in the application if previous pods failed to
> allocate.
> # Generalize this by:
> ** Adding an immediate action for Spark applications.
> ** Introducing a threshold ('n' number of pods) after which the scheduler
> will stop trying and restart the scheduling cycle.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]