[jira] [Created] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Wilfred Spiegelenburg (Jira) Tue, 06 Aug 2024 23:02:03 -0700

Wilfred Spiegelenburg created YUNIKORN-2790:
-----------------------------------------------


             Summary: GPU node restart could leave root queue always out of 
quota
                 Key: YUNIKORN-2790
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2790
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Wilfred Spiegelenburg
            Assignee: Wilfred Spiegelenburg


On a node restart the pods assigned and running on a node are not checked 
against the quota of the queue(s) they run in. This has multiple reasons. Pods 
on a node that are scheduled by YuniKorn and already running must not be 
rejected. Rejecting pods could cause lots of side effects.

The combination of a node restart and the reconfiguring a GPU driver could 
however cause a secondary issue. The node on restart might not expose the GPU 
resource yet. Pods that ran before the restart can be using the GPU resource. 
After those pods are added, ignoring quotas, the root queue will show a usage 
for a resource that has not been registered yet.

This fact prevents all scheduling from progressing. Even for pods not 
requesting the GPU resource. Each scheduling action will check the root queue 
quota and fail. This prevents the GPU driver pods to be placed and the GPU to 
be registered by the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YUNIKORN-2790) GPU node restart could leave root queue always out of quota

Reply via email to