Craig Condit created YUNIKORN-2180:
--------------------------------------

             Summary: Clean up scheduler state initialization
                 Key: YUNIKORN-2180
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2180
             Project: Apache YuniKorn
          Issue Type: Sub-task
            Reporter: Craig Condit
            Assignee: Craig Condit


Scheduler state initialization (otherwise known as recovery) is currently 
fragile and somewhat unpredictable since multiple asynchronous processes 
coordinate to perform the various init tasks.

Startup initialization should be simplified to the following steps:
 # Read all priority classes from the informer and register them with the 
scheduler cache
 # Read all nodes from the informer and register them (in a drained state) with 
the scheduler core
 # Read all pods from the informer and register applications and allocations as 
necessary, associating existing allocations with nodes from step #2
 # Enable the nodes which were originally registered in step #2
 # Register and start Kubernetes event handlers
 # Re-read priority classes from the informer and remove any that have gone 
away since step #1, ensuring we don't miss priority class deletions during init
 # Re-read nodes from the informer and remove any that have gone away since 
step #2, ensuring we don't miss node deletions during init
 # Re-read pods from the informer and remove any that have gone away since step 
#3, ensuring we don't miss pod deletions during init

Additionally, this process should be handled entirely by the scheduler context 
to avoid mulitple competing concerns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to