yangwwei opened a new pull request #279: URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/279
### What is this PR for? This PR is to fix a regression caused by https://issues.apache.org/jira/browse/YUNIKORN-677. ### What type of PR is it? * [x] - Bug Fix * [ ] - Improvement * [ ] - Feature * [ ] - Documentation * [ ] - Hot Fix * [ ] - Refactoring ### Todos * [ ] - Task ### What is the Jira issue? * https://issues.apache.org/jira/browse/YUNIKORN-741 ### What is the root cause? This is a regression caused by https://issues.apache.org/jira/browse/YUNIKORN-677. YUNIKORN-677 changes the check of how we see a pod needs recovery, now it is based on whether a pod is allocated to a node (when pod.Spec.NodeName is set). For occupied resources, it is similar, however, the fix in YUNIKORN-677 changes the condition for occupied resource recovery but leaves the node coordinator code (where we handle pod updates) as the old way. This caused the following issue: - During recovery, the scheduler sees the scheduler pod was already allocated (pod.Spec.NodeName is set), so the occupied resources was reported to the core, code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128. - Once the scheduler is recovered, the pod informers will be started, and the node coordinator starts to run. In some cases, the node informer will inform us the scheduler pod and the admission-controller pod phase changes (from Pending to Running), and this triggers another occupied resource update. Code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101 ### What is the fix? The recovery code is doing the correct thing, as long as the pod is assigned to a node and not in the terminated state, the pod resource should be recovered, including the occupied resources. The problem is post-recovery, when the scheduler is running, we need to ensure the pod update won’t trigger unnecessary occupied resource reports. To achieve that, we should enforce the following rules in the node coordinator: - For adding occupied resources, it should be only triggered when -- Pod is newly assigned to a node -- Pod is not in terminated state - For subtracting occupied resources, it should be only triggered when - Pod is already assigned to a node - Pod phase changes from non-terminated (Pending, Running) to terminated (Succeed, Failed) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
