Craig Condit created YUNIKORN-2099:
--------------------------------------
Summary: [Umbrella] State initialisation simplification (phase 2)
Key: YUNIKORN-2099
URL: https://issues.apache.org/jira/browse/YUNIKORN-2099
Project: Apache YuniKorn
Issue Type: Improvement
Components: core - scheduler, shim - kubernetes
Reporter: Craig Condit
Assignee: Craig Condit
Startup rebuilds all state of the cluster. This is called recovery. The name is
a bit misleading as it is not really recovery as it is loading the current
state. State initialisation is a better term to use.
The current recovery code links the loading of applications and tasks (pods) to
node loading. This makes the recovery code complex and thus fragile. It could,
in a worst case scenario, lead to a pod not being recovered correctly.
Recovery should be a step by step process that has boundaries and steps:
* load node
** register nodes with the core
* load pods
** create applications in core
** register running pods as allocations with the core
** register pending pods as asks with the core
* process changes for nodes and pods
* start scheduling
No nodes, applications or asks on existing apps should be declined. Even if theĀ
queue does not exist a running application must be added and handled. The
current rejection of an application if it cannot be placed in the queue is an
incorrect behaviour.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]