Yu-Lin Chen created YUNIKORN-2025:
-------------------------------------
Summary: Mismatched running container count if preemption was
triggered
Key: YUNIKORN-2025
URL: https://issues.apache.org/jira/browse/YUNIKORN-2025
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler, metrics
Reporter: Yu-Lin Chen
Attachments: Nagative Running Container Count - UI Screenshot.png,
Release Allocation Flow (Preemption).png,
YUNIKORN-2025-Mismatched-running-container-count.patch
Yunikorn UI shows nagative running container count after e2e test
‘Verify_basic_preemption’ completed. The e2e test will create below 4 pods:
* 3 pod in root.sandbox1 (1 pod will be preempted)
* 1 pod in root.sandbox2
The final running container count will be -1, and the internal metrics are:
* totalContainersRunning(-1) := allocatedContainers(4) -
releasedContainers({color:#FF0000}5{color})
→ Mismatched: releasedContainers should be 4 instead of 5.
{+}Reproduce Steps{+}:
1. Trigger the e2e test 'Verify_basic_preemption'
{code:java}
cd yunikorn-k8shim/test/e2e/preemption
ginkgo run -r -v --focus "Verify_basic_preemption" -- -yk-namespace "yunikorn"
-kube-config "$HOME/.kube/config"{code}
2. Check Yunikorn UI, the running container will be -1.
{+}Root cause and proposed solution{+}:
After trace the preemption flow, for the preempted pod, there have 2 parts that
will increase released container count:
* Core → Shim (TerminationType: PREEMPTED_BY_SCHEDULER)
* Core → Shim (TerminationType: STOPPED_BY_RM) (A callback of Shim → Core.)
We can skip the increase in first path(Core → Shim, TerminationType:
PREEMPTED_BY_SCHEDULER).
(Please refer to "Release Allocation Flow (Preemption).png" and attached patch
file.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]