[ 
https://issues.apache.org/jira/browse/YUNIKORN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved YUNIKORN-2313.
--------------------------------------
    Fix Version/s: 1.5.0
       Resolution: Fixed

> Flaky E2E Test:  "Verify_basic_preemption" experiences flakiness due to race 
> condition
> --------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2313
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2313
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>         Attachments: 2_e2e-tests (v1.28.0) (Verify_basic_preemption).txt
>
>
> The failed e2e test "Verify_basic_preemption" attempted to request more 
> resources than the available resource on the K8S node. The root cause is that 
> some pods running on the node are invisible to YuniKorn scheduler.
> The e2e test mistakenly believed that it could acquire all the available 
> resources recorded in YuniKorn.
> [https://github.com/apache/yunikorn-k8shim/actions/runs/7420644533/job/20212796775?pr=759#step:6:1483]
>  
> *Logs:* (Verify_basic_preemption)
> {*}K8S describe node result{*}: (node: yk8s-worker)
> allocable resource: 14610305024 B
> non-terminated pods's memory usage:
>  * sleepjob1 - 4870M (33%)
>  * sleepjob2 - 4870M (33%)
>  * kindnet-tr9jc - {color:#de350b}50Mi{color}(0%)
>  * kube-proxy-k58rn - 0 B
>  * yunikorn-admission-controller-56c8c8b766-55kz8 - 
> {color:#de350b}500Mi{color} (3%)
> → Total 70%
> {*}YuniKorn Node REST API response{*}: (node: yk8s-worker)
>  * memory capacity: 14610305024 B
>  * memory occupied: *{color:#de350b}Empty{color}*
>  * memory allocated: 9740000000 B (4870M + 4870M) (66%)
>  * memory available: 4870305024 B
> Node only have 30% resource remaining, so the sleepjob3(4870M, 33%) never 
> entered Running state. 
> YuniKorn is not aware of 'kindnet-tr9jc' and 
> 'yunikorn-admission-controller-56c8c8b766-55kz8' running on the node. The 
> “YuniKorn node available resource” is not equal to “K8S node available 
> resource”.
> Remove dynamic limit for the sleepJob and set a fixed pod/queue limit could 
> solve the issue on the e2e test. But still need to investigate why those 
> running pod are invisible to YuniKorn.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to