[
https://issues.apache.org/jira/browse/YUNIKORN-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg resolved YUNIKORN-1777.
---------------------------------------------
Resolution: Fixed
resolving again as the last change has been committed.
> [Umbrella] State initialisation simplification (phase 1)
> --------------------------------------------------------
>
> Key: YUNIKORN-1777
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1777
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler, shim - kubernetes
> Reporter: Wilfred Spiegelenburg
> Assignee: Craig Condit
> Priority: Critical
> Fix For: 1.4.0
>
>
> Startup rebuilds all state of the cluster. This is called recovery. The name
> is a bit misleading as it is not really recovery as it is loading the current
> state. State initialisation is a better term to use.
> The current recovery code links the loading of applications and tasks (pods)
> to node loading. This makes the recovery code complex and thus fragile. It
> could, in a worst case scenario, lead to a pod not being recovered correctly.
> Recovery should be a step by step process that has boundaries and steps:
> * load node
> ** register nodes with the core
> * load pods
> ** create applications in core
> ** register running pods as allocations with the core
> ** register pending pods as asks with the core
> * process changes for nodes and pods
> * start scheduling
> No nodes, applications or asks on existing apps should be declined. Even if
> theĀ queue does not exist a running application must be added and handled.
> The current rejection of an application if it cannot be placed in the queue
> is an incorrect behaviour.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]