[
https://issues.apache.org/jira/browse/YUNIKORN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chia-Ping Tsai resolved YUNIKORN-2313.
--------------------------------------
Fix Version/s: 1.5.0
Resolution: Fixed
> Flaky E2E Test: "Verify_basic_preemption" experiences flakiness due to race
> condition
> --------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2313
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2313
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: test - e2e
> Reporter: Yu-Lin Chen
> Assignee: Yu-Lin Chen
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.5.0
>
> Attachments: 2_e2e-tests (v1.28.0) (Verify_basic_preemption).txt
>
>
> The failed e2e test "Verify_basic_preemption" attempted to request more
> resources than the available resource on the K8S node. The root cause is that
> some pods running on the node are invisible to YuniKorn scheduler.
> The e2e test mistakenly believed that it could acquire all the available
> resources recorded in YuniKorn.
> [https://github.com/apache/yunikorn-k8shim/actions/runs/7420644533/job/20212796775?pr=759#step:6:1483]
>
> *Logs:* (Verify_basic_preemption)
> {*}K8S describe node result{*}: (node: yk8s-worker)
> allocable resource: 14610305024 B
> non-terminated pods's memory usage:
> * sleepjob1 - 4870M (33%)
> * sleepjob2 - 4870M (33%)
> * kindnet-tr9jc - {color:#de350b}50Mi{color}(0%)
> * kube-proxy-k58rn - 0 B
> * yunikorn-admission-controller-56c8c8b766-55kz8 -
> {color:#de350b}500Mi{color} (3%)
> → Total 70%
> {*}YuniKorn Node REST API response{*}: (node: yk8s-worker)
> * memory capacity: 14610305024 B
> * memory occupied: *{color:#de350b}Empty{color}*
> * memory allocated: 9740000000 B (4870M + 4870M) (66%)
> * memory available: 4870305024 B
> Node only have 30% resource remaining, so the sleepjob3(4870M, 33%) never
> entered Running state.
> YuniKorn is not aware of 'kindnet-tr9jc' and
> 'yunikorn-admission-controller-56c8c8b766-55kz8' running on the node. The
> “YuniKorn node available resource” is not equal to “K8S node available
> resource”.
> Remove dynamic limit for the sleepJob and set a fixed pod/queue limit could
> solve the issue on the e2e test. But still need to investigate why those
> running pod are invisible to YuniKorn.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]