Weiwei Yang created YUNIKORN-741:
------------------------------------
Summary: Regression: occupied resources miscalculated sometimes
for yunikorn pods
Key: YUNIKORN-741
URL: https://issues.apache.org/jira/browse/YUNIKORN-741
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Reporter: Weiwei Yang
This is a regression caused by YUNIKORN-677.
YUNIKORN-677 changes the check of how we see a pod needs recovery, now it is
based on whether a pod is allocated to a node (when pod.Spec.NodeName is set).
For occupied resources, it is similar, however, the fix in YUNIKORN-677 changes
the condition for occupied resource recovery but leaves the node coordinator
code (where we handle pod updates) as the old way. This caused the following
issue:
* During recovery, the scheduler sees the scheduler pod was already allocated
(pod.Spec.NodeName is set), so the occupied resources were reported to the
core, code:
[https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128].
* Once the scheduler is recovered, the pod informers will be started, and the
node coordinator starts to run. In some cases, the node informer will inform us
of the scheduler pod and the admission-controller pod phase changes (from
Pending to Running), and this triggers another occupied resource update. Code:
[https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]