Mit Desai created YUNIKORN-2929:
-----------------------------------
Summary: Skip allocation attempts for subsequent pods in an
application if previous pods have failed to allocate
Key: YUNIKORN-2929
URL: https://issues.apache.org/jira/browse/YUNIKORN-2929
Project: Apache YuniKorn
Issue Type: Task
Reporter: Mit Desai
Assignee: Mit Desai
When running Spark applications, if an executor pod fails to find a suitable
node, it is likely that subsequent executor pods will also fail to find nodes.
This is particularly problematic when the application has a toleration for a
specific taint and there are limited nodes with that taint. The scheduler
spends excessive time attempting to allocate pods, ultimately resulting in no
pods being bound to nodes.
To optimize scheduling, we should:
# Implement a check to determine if previous pods in the same application were
successfully allocated.
# Skip processing other pods in the application if previous pods failed to
allocate.
# Generalize this by:
** Adding an immediate action for Spark applications.
** Introducing a threshold ('n' number of pods) after which the scheduler will
stop trying and restart the scheduling cycle.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]