[
https://issues.apache.org/jira/browse/YUNIKORN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730477#comment-17730477
]
Wilfred Spiegelenburg commented on YUNIKORN-1793:
-------------------------------------------------
We need to keep these two things in one place. Queue information is never
updated on the pods, and we should not do that as it does not fix the issue.
BTW: moving away from recovery as a name and moving to initialisation. Recovery
is not the right term, we initialise the state of YuniKorn after a startup, we
do not recover as there is no state in YuniKorn.
When we initialise we must not reject anything that was scheduled by YuniKorn
earlier, whatever the reason would be. The issue needs to be solved as part of
initialisation. We should not link this to the queue type change or require
that feature to implement this fix.
We already ignore quotas on the queues during the initialisation as those might
have been changed in the config. The fix for this second part is also directly
linked to initialisation. However it is a bit more complex. The pre-requisite
is the uncoupling of the node, application and requests creation during
initialisation. Every object needs its own handling. Nodes and allocations are
handled as one, one message. Applications are separated out already.
The current code removes the node when an allocation exists for an application
that failed placement. This is wrong. The fix has multiple parts. Starting with
a do not reject the app on init. The rest will follow on top of that. Before
any of that can happen we need to change the way the K8shim and core interact
for this.
> Handle placement rule and queue changes during initialisation
> -------------------------------------------------------------
>
> Key: YUNIKORN-1793
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1793
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: core - common
> Reporter: Wilfred Spiegelenburg
> Priority: Major
>
> If the placement rules change loading an already running workload might fail.
> Similar case exists for queue that no longer exist in the config. Even if the
> queue exists the type could have changed from leaf to parent etc.
> Running workloads should never be rejected during init. If a placement fails
> the application should be placed in a temporary queue. This needs to be
> restricted to workloads during init only and a queue should be used that
> cannot be created via the configuration or placement rules.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]