[
https://issues.apache.org/jira/browse/YUNIKORN-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-2790:
--------------------------------------------
Labels: release-notes (was: )
> GPU node restart could leave root queue always out of quota
> -----------------------------------------------------------
>
> Key: YUNIKORN-2790
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2790
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Wilfred Spiegelenburg
> Assignee: Wilfred Spiegelenburg
> Priority: Critical
> Labels: release-notes
>
> On a node restart the pods assigned and running on a node are not checked
> against the quota of the queue(s) they run in. This has multiple reasons.
> Pods on a node that are scheduled by YuniKorn and already running must not be
> rejected. Rejecting pods could cause lots of side effects.
> The combination of a node restart and the reconfiguring a GPU driver could
> however cause a secondary issue. The node on restart might not expose the GPU
> resource yet. Pods that ran before the restart can be using the GPU resource.
> After those pods are added, ignoring quotas, the root queue will show a usage
> for a resource that has not been registered yet.
> This fact prevents all scheduling from progressing. Even for pods not
> requesting the GPU resource. Each scheduling action will check the root queue
> quota and fail. This prevents the GPU driver pods to be placed and the GPU to
> be registered by the node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]