[ 
https://issues.apache.org/jira/browse/YUNIKORN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779180#comment-17779180
 ] 

Yu-Lin Chen edited comment on YUNIKORN-2070 at 10/24/23 5:08 PM:
-----------------------------------------------------------------

Actually, this issue can not be reproduced in my environment, so I'm not so 
sure if increasing to 20MB could solve it. (The real memory overhead might 
relate to device loading/containerd version.)

Could I create another PR and only keep the gang scheduling e2e test? So we can 
have several trial runs with minimum impacts?


was (Author: yu-lin chen):
Actually, this issue can not be reproduced in my environment, so I'm not so 
sure if increasing to 20MB could solve it. (The real memory overhead might 
relate to device loading/containerd version.)

Could I create another PR and only keep the gang scheduling e2e test, so we can 
have a several trial runs with minimum impact?

> E2e tests for gang_scheduling failed due to containers init were OOM-Killed
> ---------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2070
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2070
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>              Labels: pull-request-available
>
> Recently we encountered several gang scheduling errors in CI e2e test, all of 
> the failures are waiting for the creation of placeholders(with 10M memory 
> limit). However, some placeholders are failed with below OOM-killed error:
> {code:java}
> “Error: failed to create containerd task: failed to create shim task: OCI 
> runtime create failed: runc create failed: unable to start container process: 
> container init was OOM-killed (memory limit too low?): unknown” {code}
> The root cause might be the varying memory peak when OCI runtime create 
> multiple containers. We can try to change placeholder memory limit from 10M 
> to 20M in e2e test. (Sleep jobs are using 20M memory.)
> List some failed e2e test in last 3 weeks:
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:934])
>  Target 15 placeholder, 14 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:1047])
>  Target 3 placeholder, 2 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:1033])
>  Target 3 placeholder, 2 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:932])
>  Target 15 placeholder, 11 created 4 OOM-Killed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to