[
https://issues.apache.org/jira/browse/YUNIKORN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg resolved YUNIKORN-2025.
---------------------------------------------
Fix Version/s: 1.4.0
Resolution: Fixed
Releases confirmation triggered by preempted containers are not passed back to
the shim. This fixes the double counting of the releases.
thank you for the fix [~Yu-Lin Chen]
> Mismatched running container count if preemption was triggered
> --------------------------------------------------------------
>
> Key: YUNIKORN-2025
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2025
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler, metrics
> Reporter: Yu-Lin Chen
> Assignee: Yu-Lin Chen
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.4.0
>
> Attachments: Nagative Running Container Count - UI Screenshot.png,
> Release Allocation Flow (Preemption).png,
> YUNIKORN-2025-Mismatched-running-container-count.patch
>
>
> Yunikorn UI shows negative running container count after e2e test
> ‘Verify_basic_preemption’ completed. The e2e test will create below 4 pods:
> * 3 pod in root.sandbox1 (1 pod will be preempted)
> * 1 pod in root.sandbox2
> The final running container count will be -1, and the internal metrics are:
> * totalContainersRunning(-1) := allocatedContainers(4) -
> releasedContainers({color:#ff0000}5{color})
> → Mismatched: releasedContainers should be 4 instead of 5.
> {+}Reproduce Steps{+}:
> 1. Trigger the e2e test 'Verify_basic_preemption'
>
> {code:java}
> cd yunikorn-k8shim/test/e2e/preemption
> ginkgo run -r -v --focus "Verify_basic_preemption" -- -yk-namespace
> "yunikorn" -kube-config "$HOME/.kube/config"{code}
> 2. Check Yunikorn UI, the running container will be -1.
>
> {+}Root cause and proposed solution{+}:
> After trace the preemption flow, for the preempted pod, there have 2 parts
> that will increase released container count:
> * Core → Shim (TerminationType: PREEMPTED_BY_SCHEDULER)
> * Core → Shim (TerminationType: STOPPED_BY_RM) (A callback of Shim → Core.)
> We can skip the increase in first path(Core → Shim, TerminationType:
> PREEMPTED_BY_SCHEDULER).
> (Please refer to "Release Allocation Flow (Preemption).png" and attached
> patch file.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]