[jira] [Commented] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Wilfred Spiegelenburg (Jira) Wed, 07 Aug 2024 00:26:05 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871556#comment-17871556
 ]


Wilfred Spiegelenburg commented on YUNIKORN-2790:
-------------------------------------------------

Solution is to not check resource types that are not requested by pods when we 
check for a fit in the queue. This will allow a pod asking for memory and 
vcores to be scheduled even if the root queue is out of GPU or storage. This 
should not happen on any other queue in the hierarchy unless the quota has been 
changed to become lower than the currently running workload.

This makes scheduling more resilient for configuration changes and custom 
resource registration delays.

> GPU node restart could leave root queue always out of quota
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-2790
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2790
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>
> On a node restart the pods assigned and running on a node are not checked 
> against the quota of the queue(s) they run in. This has multiple reasons. 
> Pods on a node that are scheduled by YuniKorn and already running must not be 
> rejected. Rejecting pods could cause lots of side effects.
> The combination of a node restart and the reconfiguring a GPU driver could 
> however cause a secondary issue. The node on restart might not expose the GPU 
> resource yet. Pods that ran before the restart can be using the GPU resource. 
> After those pods are added, ignoring quotas, the root queue will show a usage 
> for a resource that has not been registered yet.
> This fact prevents all scheduling from progressing. Even for pods not 
> requesting the GPU resource. Each scheduling action will check the root queue 
> quota and fail. This prevents the GPU driver pods to be placed and the GPU to 
> be registered by the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Reply via email to