Weiwei Yang created YUNIKORN-741:
------------------------------------

             Summary: Regression: occupied resources miscalculated sometimes 
for yunikorn pods
                 Key: YUNIKORN-741
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-741
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Weiwei Yang


This is a regression caused by YUNIKORN-677. 

YUNIKORN-677 changes the check of how we see a pod needs recovery, now it is 
based on whether a pod is allocated to a node (when pod.Spec.NodeName is set). 
For occupied resources, it is similar, however, the fix in YUNIKORN-677 changes 
the condition for occupied resource recovery but leaves the node coordinator 
code (where we handle pod updates) as the old way. This caused the following 
issue:
 * During recovery, the scheduler sees the scheduler pod was already allocated 
(pod.Spec.NodeName is set), so the occupied resources were reported to the 
core, code: 
[https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128].
 * Once the scheduler is recovered, the pod informers will be started, and the 
node coordinator starts to run. In some cases, the node informer will inform us 
of the scheduler pod and the admission-controller pod phase changes (from 
Pending to Running), and this triggers another occupied resource update. Code: 
[https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to