[ 
https://issues.apache.org/jira/browse/YUNIKORN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730477#comment-17730477
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-1793:
-------------------------------------------------

We need to keep these two things in one place. Queue information is never 
updated on the pods, and we should not do that as it does not fix the issue.

BTW: moving away from recovery as a name and moving to initialisation. Recovery 
is not the right term, we initialise the state of YuniKorn after a startup, we 
do not recover as there is no state in YuniKorn.

When we initialise we must not reject anything that was scheduled by YuniKorn 
earlier, whatever the reason would be. The issue needs to be solved as part of 
initialisation. We should not link this to the queue type change or require 
that feature to implement this fix.

We already ignore quotas on the queues during the initialisation as those might 
have been changed in the config. The fix for this second part is also directly 
linked to initialisation. However it is a bit more complex. The pre-requisite 
is the uncoupling of the node, application and requests creation during 
initialisation. Every object needs its own handling. Nodes and allocations are 
handled as one, one message. Applications are separated out already.

The current code removes the node when an allocation exists for an application 
that failed placement. This is wrong. The fix has multiple parts. Starting with 
a do not reject the app on init. The rest will follow on top of that. Before 
any of that can happen we need to change the way the K8shim and core interact 
for this.

> Handle placement rule and queue changes during initialisation
> -------------------------------------------------------------
>
>                 Key: YUNIKORN-1793
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1793
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: core - common
>            Reporter: Wilfred Spiegelenburg
>            Priority: Major
>
> If the placement rules change loading an already running workload might fail.
> Similar case exists for queue that no longer exist in the config. Even if the 
> queue exists the type could have changed from leaf to parent etc.
> Running workloads should never be rejected during init. If a placement fails 
> the application should be placed in a temporary queue. This needs to be 
> restricted to workloads during init only and a queue should be used that 
> cannot be created via the configuration or placement rules.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to