[jira] [Updated] (YUNIKORN-2313) Flaky E2E Test: "Verify_basic_preemption" try to request more resource than the node’s available resource

Yu-Lin Chen (Jira) Sat, 06 Jan 2024 09:36:04 -0800


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu-Lin Chen updated YUNIKORN-2313:
----------------------------------
    Description: 
The failed e2e test "Verify_basic_preemption" attempted to request more 
resources than the available resource on the K8S node. The root cause is that 
some pods running on the node are invisible to YuniKorn scheduler.

The e2e test mistakenly believed that it could acquire all the available 
resources recorded in YuniKorn.

[https://github.com/apache/yunikorn-k8shim/actions/runs/7420644533/job/20212796775?pr=759#step:6:1483]

 

*Logs:* (Verify_basic_preemption)

{*}K8S describe node result{*}: (node: yk8s-worker)
allocable resource: 14610305024 B
non-terminated pods's memory usage:
 * sleepjob1 - 4870M (33%)
 * sleepjob2 - 4870M (33%)
 * kindnet-tr9jc - {color:#de350b}50Mi{color}(0%)
 * kube-proxy-k58rn - 0 B
 * yunikorn-admission-controller-56c8c8b766-55kz8 - {color:#de350b}500Mi{color} 
(3%)

→ Total 70%

{*}YuniKorn Node REST API response{*}: (node: yk8s-worker)
 * memory capacity: 14610305024 B
 * memory allocated: 9740000000 B (4870M + 4870M) (66%)
 * memory available: 4870305024 B

Node only have 30% resource remaining, so the sleepjob3(4870M, 33%) never 
entered Running state. 

YuniKorn is not aware of 'kindnet-tr9jc' and 
'yunikorn-admission-controller-56c8c8b766-55kz8' running on the node. The 
“YuniKorn node available resource” is not equal to “K8S node available 
resource”.

Remove dynamic limit for the sleepJob and set a fixed pod/queue limit could 
solve the issue on the e2e test. But still need to investigate why those 
running pod are invisible to YuniKorn.
 

  was:
The failed e2e test "Verify_basic_preemption" attempted to request more 
resources than the available resource on the K8S node. The root cause is that 
some pods running on the node are invisible to YuniKorn scheduler.

The e2e test mistakenly believed that it could acquire all the available 
resources record in YuniKorn.

[https://github.com/apache/yunikorn-k8shim/actions/runs/7420644533/job/20212796775?pr=759#step:6:1483]

 

*Logs:* (Verify_basic_preemption)

{*}K8S describe node result{*}: (node: yk8s-worker)
allocable resource: 14610305024 B
non-terminated pods's memory usage:
 * sleepjob1 - 4870M (33%)
 * sleepjob2 - 4870M (33%)
 * kindnet-tr9jc - {color:#de350b}50Mi{color}(0%)
 * kube-proxy-k58rn - 0 B
 * yunikorn-admission-controller-56c8c8b766-55kz8 - {color:#de350b}500Mi{color} 
(3%)

→ Total 70%

{*}YuniKorn Node REST API response{*}: (node: yk8s-worker)
 * memory capacity: 14610305024 B
 * memory allocated: 9740000000 B (4870M + 4870M) (66%)
 * memory available: 4870305024 B

Node only have 30% resource remaining, so the sleepjob3(4870M, 33%) never 
entered Running state. 

YuniKorn is not aware of 'kindnet-tr9jc' and 
'yunikorn-admission-controller-56c8c8b766-55kz8' running on the node. The 
“YuniKorn node available resource” is not equal to “K8S node available 
resource”.

Remove dynamic limit for the sleepJob and set a fixed pod/queue limit could 
solve the issue on the e2e test. But still need to investigate why those 
running pod are invisible to YuniKorn.
 


> Flaky E2E Test:  "Verify_basic_preemption" try to request more resource than 
> the node’s available resource
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2313
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2313
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>         Attachments: 2_e2e-tests (v1.28.0) （Verify_basic_preemption）.txt
>
>
> The failed e2e test "Verify_basic_preemption" attempted to request more 
> resources than the available resource on the K8S node. The root cause is that 
> some pods running on the node are invisible to YuniKorn scheduler.
> The e2e test mistakenly believed that it could acquire all the available 
> resources recorded in YuniKorn.
> [https://github.com/apache/yunikorn-k8shim/actions/runs/7420644533/job/20212796775?pr=759#step:6:1483]
>  
> *Logs:* (Verify_basic_preemption)
> {*}K8S describe node result{*}: (node: yk8s-worker)
> allocable resource: 14610305024 B
> non-terminated pods's memory usage:
>  * sleepjob1 - 4870M (33%)
>  * sleepjob2 - 4870M (33%)
>  * kindnet-tr9jc - {color:#de350b}50Mi{color}(0%)
>  * kube-proxy-k58rn - 0 B
>  * yunikorn-admission-controller-56c8c8b766-55kz8 - 
> {color:#de350b}500Mi{color} (3%)
> → Total 70%
> {*}YuniKorn Node REST API response{*}: (node: yk8s-worker)
>  * memory capacity: 14610305024 B
>  * memory allocated: 9740000000 B (4870M + 4870M) (66%)
>  * memory available: 4870305024 B
> Node only have 30% resource remaining, so the sleepjob3(4870M, 33%) never 
> entered Running state. 
> YuniKorn is not aware of 'kindnet-tr9jc' and 
> 'yunikorn-admission-controller-56c8c8b766-55kz8' running on the node. The 
> “YuniKorn node available resource” is not equal to “K8S node available 
> resource”.
> Remove dynamic limit for the sleepJob and set a fixed pod/queue limit could 
> solve the issue on the e2e test. But still need to investigate why those 
> running pod are invisible to YuniKorn.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-2313) Flaky E2E Test: "Verify_basic_preemption" try to request more resource than the node’s available resource

Reply via email to