[
https://issues.apache.org/jira/browse/YUNIKORN-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-2465:
--------------------------------------------
Target Version: 1.6.0
Issue Type: Bug (was: Task)
Priority: Critical (was: Major)
I have moved this to a bug, based on the description we run OOM which we should
never do and thus we should consider this as a bug. I also upped the priority
to critical also and for now targeted it for 1.6.0
Rationale behind this:
Even for shorter running large applications I can see this as an issue.
Application often have a longer running tail with a small number of pods still
around. If that application has used large numbers of pods over its lifetime (
for example 50,000 in total) they stick around. This could be worse in a large
clusters with multiple of these applications running, partially in parallel;
partially sequential without peak usage overlap. In those cases it could mean
that we're tracking multiple times more tasks that are not active than active.
We do not need the active tasks...
> Remove Task objects from the shim upon pod completion
> -----------------------------------------------------
>
> Key: YUNIKORN-2465
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2465
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Critical
>
> We don't remove Task objects from the shim when the pod completes. This has
> consequences for long running workloads which keep generating new pods with
> the same applicationID such as Spark Streaming. The ever increasing memory
> usage eventually results in an OOM and the termination of Yunikorn. Tasks are
> only removed when the application reaches Completed state in the
> scheduler-core.
> Restart fixes the situation because completed pods are not restored and added
> to the Context/Application. We should remove the tasks during the lifetime of
> the application unless there's a good reason not to.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]