[jira] [Comment Edited] (FLINK-16225) Metaspace Out Of Memory should be handled as Fatal Error in TaskManager

Andrey Zagrebin (Jira) Mon, 15 Jun 2020 07:34:25 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135918#comment-17135918
 ]


Andrey Zagrebin edited comment on FLINK-16225 at 6/15/20, 2:33 PM:
-------------------------------------------------------------------

merge into master by bac55c175b0d3a76395880d7aa2e5ae6484364fa
merge into 1.11 by 435ec274129dbef84b4d93526f07d2e2c6332585
decided not to merge into 1.10, see details 
[here|https://github.com/apache/flink/pull/12563#issuecomment-644170325]


was (Author: azagrebin):
merge into master by bac55c175b0d3a76395880d7aa2e5ae6484364fa
merge into 1.11 by 435ec274129dbef84b4d93526f07d2e2c6332585
decided not to merge into 1.10, see details 
[here|[https://github.com/apache/flink/pull/12563#issuecomment-644170325]

> Metaspace Out Of Memory should be handled as Fatal Error in TaskManager
> -----------------------------------------------------------------------
>
>                 Key: FLINK-16225
>                 URL: https://issues.apache.org/jira/browse/FLINK-16225
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.10.0
>            Reporter: Stephan Ewen
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>              Labels: pull-request-available, usability
>             Fix For: 1.11.0
>
>
> When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no 
> way to recover. This is often the result of user code or libraries that have 
> subtle class loading leaks.
> The one way to recover is to kill the TaskManagers and to let the resource 
> orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should 
> then be able to recover the job.
> I would suggest to implement this the following way:
> * The user code ClassLoader takes an "OOM Handler", which is called when 
> class loading causes an OOM exception.
> * The handler wraps this into an Exception with a good error message (see 
> below) and invokes the TaskManager's {{FatalErrorHandler}}.
> * The {{FatalErrorHandler}} in turn should attempt to cancel everything and 
> notify the JM before shutting down. That way, we get decent error reporting 
> and users can see what is going on.
> The error message should describe the following:
> * If user sees the error consistently on the first deploy, then the metaspace 
> is simply too small for their application, and they need to explicitly 
> increase it in the configuration
> * If the user sees occasionally TaskManagers in a session cluster failing 
> with that exception when deploying new jobs, then some user code or library 
> probably has a class leak. The TM failure / restart is done in order to 
> forcefully clean up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-16225) Metaspace Out Of Memory should be handled as Fatal Error in TaskManager

Reply via email to