yangwwei opened a new pull request #279:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/279


   ### What is this PR for?
   This PR is to fix a regression caused by 
https://issues.apache.org/jira/browse/YUNIKORN-677.
   
   
   ### What type of PR is it?
   * [x] - Bug Fix
   * [ ] - Improvement
   * [ ] - Feature
   * [ ] - Documentation
   * [ ] - Hot Fix
   * [ ] - Refactoring
   
   ### Todos
   * [ ] - Task
   
   ### What is the Jira issue?
   * https://issues.apache.org/jira/browse/YUNIKORN-741
   
   ### What is the root cause?
   
   This is a regression caused by 
https://issues.apache.org/jira/browse/YUNIKORN-677.  YUNIKORN-677 changes the 
check of how we see a pod needs recovery, now it is based on whether a pod is 
allocated to a node (when pod.Spec.NodeName is set). For occupied resources, it 
is similar, however, the fix in YUNIKORN-677 changes the condition for occupied 
resource recovery but leaves the node coordinator code (where we handle pod 
updates) as the old way. This caused the following issue:
   
   - During recovery, the scheduler sees the scheduler pod was already 
allocated (pod.Spec.NodeName is set), so the occupied resources was reported to 
the core, code: 
https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128.
   - Once the scheduler is recovered, the pod informers will be started, and 
the node coordinator starts to run. In some cases, the node informer will 
inform us the scheduler pod and the admission-controller pod phase changes 
(from Pending to Running), and this triggers another occupied resource update. 
Code: 
https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101
   
   ### What is the fix?
   
   The recovery code is doing the correct thing, as long as the pod is assigned 
to a node and not in the terminated state, the pod resource should be 
recovered, including the occupied resources. The problem is post-recovery, when 
the scheduler is running, we need to ensure the pod update won’t trigger 
unnecessary occupied resource reports. To achieve that, we should enforce the 
following rules in the node coordinator:
   
   - For adding occupied resources, it should be only triggered when
   -- Pod is newly assigned to a node
   -- Pod is not in terminated state
   - For subtracting occupied resources, it should be only triggered when
     - Pod is already assigned to a node
     - Pod phase changes from non-terminated (Pending, Running) to terminated 
(Succeed, Failed)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to