Wilfred Spiegelenburg created YUNIKORN-2790:
-----------------------------------------------
Summary: GPU node restart could leave root queue always out of
quota
Key: YUNIKORN-2790
URL: https://issues.apache.org/jira/browse/YUNIKORN-2790
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
On a node restart the pods assigned and running on a node are not checked
against the quota of the queue(s) they run in. This has multiple reasons. Pods
on a node that are scheduled by YuniKorn and already running must not be
rejected. Rejecting pods could cause lots of side effects.
The combination of a node restart and the reconfiguring a GPU driver could
however cause a secondary issue. The node on restart might not expose the GPU
resource yet. Pods that ran before the restart can be using the GPU resource.
After those pods are added, ignoring quotas, the root queue will show a usage
for a resource that has not been registered yet.
This fact prevents all scheduling from progressing. Even for pods not
requesting the GPU resource. Each scheduling action will check the root queue
quota and fail. This prevents the GPU driver pods to be placed and the GPU to
be registered by the node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]