Yu-Lin Chen created YUNIKORN-2070:
-------------------------------------

             Summary: E2e tests for gang_scheduling failed due to containers 
init ware OOM-Killed
                 Key: YUNIKORN-2070
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2070
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: test - e2e
            Reporter: Yu-Lin Chen
            Assignee: Yu-Lin Chen


Recently we encountered several gang scheduling errors in CI e2e test, all of 
the failures are waiting for the creation of placeholders(with 10M memory 
limit). However, some placeholders are failed with below OOM-killed error:
{code:java}
“Error: failed to create containerd task: failed to create shim task: OCI 
runtime create failed: runc create failed: unable to start container process: 
container init was OOM-killed (memory limit too low?): unknown” {code}
The root cause might be the varying memory peak when OCI runtime create 
multiple containers. We can try to change placeholder memory limit from 10M to 
20M in e2e test. (Sleep jobs are using 20M memory.)

List some failed e2e test in last 3 weeks:
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:934])
 Target 15 placeholder, 14 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:1047])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:1033])
 Target 3 placeholder, 2 created 1 OOM-Killed.
 # 
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:932])
 Target 15 placeholder, 11 created 4 OOM-Killed.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to