Craig Condit created YUNIKORN-2099:
--------------------------------------

             Summary: [Umbrella] State initialisation simplification (phase 2)
                 Key: YUNIKORN-2099
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2099
             Project: Apache YuniKorn
          Issue Type: Improvement
          Components: core - scheduler, shim - kubernetes
            Reporter: Craig Condit
            Assignee: Craig Condit


Startup rebuilds all state of the cluster. This is called recovery. The name is 
a bit misleading as it is not really recovery as it is loading the current 
state. State initialisation is a better term to use.

The current recovery code links the loading of applications and tasks (pods) to 
node loading. This makes the recovery code complex and thus fragile. It could, 
in a worst case scenario, lead to a pod not being recovered correctly.

Recovery should be a step by step process that has boundaries and steps:
 * load node
 ** register nodes with the core
 * load pods
 ** create applications in core
 ** register running pods as allocations with the core
 ** register pending pods as asks with the core
 * process changes for nodes and pods
 * start scheduling

No nodes, applications or asks on existing apps should be declined. Even if theĀ 
 queue does not exist a running application must be added and handled. The 
current rejection of an application if it cannot be placed in the queue is an 
incorrect behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to