[ 
https://issues.apache.org/jira/browse/YUNIKORN-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2465:
--------------------------------------------
    Target Version: 1.6.0
        Issue Type: Bug  (was: Task)
          Priority: Critical  (was: Major)

I have moved this to a bug, based on the description we run OOM which we should 
never do and thus we should consider this as a bug. I also upped the priority 
to critical also and for now targeted it for 1.6.0

Rationale behind this: 

Even for shorter running large applications I can see this as an issue. 
Application often have a longer running tail with a small number of pods still 
around. If that application has used large numbers of pods over its lifetime ( 
for example 50,000 in total) they stick around. This could be worse in a large 
clusters with multiple of these applications running, partially in parallel; 
partially sequential without peak usage overlap. In those cases it could mean 
that we're tracking multiple times more tasks that are not active than active.

We do not need the active tasks...

> Remove Task objects from the shim upon pod completion
> -----------------------------------------------------
>
>                 Key: YUNIKORN-2465
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2465
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Critical
>
> We don't remove Task objects from the shim when the pod completes. This has 
> consequences for long running workloads which keep generating new pods with 
> the same applicationID such as Spark Streaming. The ever increasing memory 
> usage eventually results in an OOM and the termination of Yunikorn. Tasks are 
> only removed when the application reaches Completed state in the 
> scheduler-core.
> Restart fixes the situation because completed pods are not restored and added 
> to the Context/Application. We should remove the tasks during the lifetime of 
> the application unless there's a good reason not to.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to