[jira] [Updated] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Wilfred Spiegelenburg (Jira) Wed, 07 Aug 2024 01:37:38 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wilfred Spiegelenburg updated YUNIKORN-2790:
--------------------------------------------
    Labels: release-notes  (was: )

> GPU node restart could leave root queue always out of quota
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-2790
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2790
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>              Labels: release-notes
>
> On a node restart the pods assigned and running on a node are not checked 
> against the quota of the queue(s) they run in. This has multiple reasons. 
> Pods on a node that are scheduled by YuniKorn and already running must not be 
> rejected. Rejecting pods could cause lots of side effects.
> The combination of a node restart and the reconfiguring a GPU driver could 
> however cause a secondary issue. The node on restart might not expose the GPU 
> resource yet. Pods that ran before the restart can be using the GPU resource. 
> After those pods are added, ignoring quotas, the root queue will show a usage 
> for a resource that has not been registered yet.
> This fact prevents all scheduling from progressing. Even for pods not 
> requesting the GPU resource. Each scheduling action will check the root queue 
> quota and fail. This prevents the GPU driver pods to be placed and the GPU to 
> be registered by the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Reply via email to