[ 
https://issues.apache.org/jira/browse/YUNIKORN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen updated YUNIKORN-2070:
----------------------------------
    Description: 
Recently we encountered several gang scheduling errors in CI e2e test, all of 
the failures are waiting for the creation of placeholders(with 10M memory 
limit). However, some placeholders are failed with below OOM-killed error:
{code:java}
“Error: failed to create containerd task: failed to create shim task: OCI 
runtime create failed: runc create failed: unable to start container process: 
container init was OOM-killed (memory limit too low?): unknown” {code}
The root cause might be the varying memory peak when OCI runtime create 
multiple containers. We can try to change placeholder memory limit from 10M to 
20M in e2e test. (Sleep jobs are using 20M memory.)

List some failed e2e test in last 3 weeks:
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:2452])
 Target 15 placeholder, 14 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:2315])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:2510])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:2798])
 Target 15 placeholder, 11 created 4 OOM-Killed.
 

  was:
Recently we encountered several gang scheduling errors in CI e2e test, all of 
the failures are waiting for the creation of placeholders(with 10M memory 
limit). However, some placeholders are failed with below OOM-killed error:
{code:java}
“Error: failed to create containerd task: failed to create shim task: OCI 
runtime create failed: runc create failed: unable to start container process: 
container init was OOM-killed (memory limit too low?): unknown” {code}
The root cause might be the varying memory peak when OCI runtime create 
multiple containers. We can try to change placeholder memory limit from 10M to 
20M in e2e test. (Sleep jobs are using 20M memory.)

List some failed e2e test in last 3 weeks:
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:934])
 Target 15 placeholder, 14 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:1047])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:1033])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:932])
 Target 15 placeholder, 11 created 4 OOM-Killed.

 


> E2e tests for gang_scheduling failed due to containers init were OOM-Killed
> ---------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2070
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2070
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>              Labels: pull-request-available
>
> Recently we encountered several gang scheduling errors in CI e2e test, all of 
> the failures are waiting for the creation of placeholders(with 10M memory 
> limit). However, some placeholders are failed with below OOM-killed error:
> {code:java}
> “Error: failed to create containerd task: failed to create shim task: OCI 
> runtime create failed: runc create failed: unable to start container process: 
> container init was OOM-killed (memory limit too low?): unknown” {code}
> The root cause might be the varying memory peak when OCI runtime create 
> multiple containers. We can try to change placeholder memory limit from 10M 
> to 20M in e2e test. (Sleep jobs are using 20M memory.)
> List some failed e2e test in last 3 weeks:
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:2452])
>  Target 15 placeholder, 14 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:2315])
>  Target 3 placeholder, 2 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:2510])
>  Target 3 placeholder, 2 created 1 OOM-Killed.
>  # 
> ([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:2798])
>  Target 15 placeholder, 11 created 4 OOM-Killed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to