Yu-Lin Chen created YUNIKORN-2070:
-------------------------------------
Summary: E2e tests for gang_scheduling failed due to containers
init ware OOM-Killed
Key: YUNIKORN-2070
URL: https://issues.apache.org/jira/browse/YUNIKORN-2070
Project: Apache YuniKorn
Issue Type: Bug
Components: test - e2e
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen
Recently we encountered several gang scheduling errors in CI e2e test, all of
the failures are waiting for the creation of placeholders(with 10M memory
limit). However, some placeholders are failed with below OOM-killed error:
{code:java}
“Error: failed to create containerd task: failed to create shim task: OCI
runtime create failed: runc create failed: unable to start container process:
container init was OOM-killed (memory limit too low?): unknown” {code}
The root cause might be the varying memory peak when OCI runtime create
multiple containers. We can try to change placeholder memory limit from 10M to
20M in e2e test. (Sleep jobs are using 20M memory.)
List some failed e2e test in last 3 weeks:
#
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6604772421/job/17945394693#step:5:934])
Target 15 placeholder, 14 created 1 OOM-Killed.
#
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6596361827/job/17922430982#step:5:1047])
Target 3 placeholder, 2 created 1 OOM-Killed.
#
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6408692237/job/17436748282#step:5:1033])
Target 3 placeholder, 2 created 1 OOM-Killed.
#
([Link|https://github.com/apache/yunikorn-k8shim/actions/runs/6545212501/job/17773871963#step:5:932])
Target 15 placeholder, 11 created 4 OOM-Killed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]