[ 
https://issues.apache.org/jira/browse/FLINK-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated FLINK-16225:
----------------------------
    Fix Version/s:     (was: 1.10.1)
                   1.10.2

> Metaspace Out Of Memory should be handled as Fatal Error in TaskManager
> -----------------------------------------------------------------------
>
>                 Key: FLINK-16225
>                 URL: https://issues.apache.org/jira/browse/FLINK-16225
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.10.0
>            Reporter: Stephan Ewen
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>              Labels: usability
>             Fix For: 1.11.0, 1.10.2
>
>
> When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no 
> way to recover. This is often the result of user code or libraries that have 
> subtle class loading leaks.
> The one way to recover is to kill the TaskManagers and to let the resource 
> orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should 
> then be able to recover the job.
> I would suggest to implement this the following way:
> * The user code ClassLoader takes an "OOM Handler", which is called when 
> class loading causes an OOM exception.
> * The handler wraps this into an Exception with a good error message (see 
> below) and invokes the TaskManager's {{FatalErrorHandler}}.
> * The {{FatalErrorHandler}} in turn should attempt to cancel everything and 
> notify the JM before shutting down. That way, we get decent error reporting 
> and users can see what is going on.
> The error message should describe the following:
> * If user sees the error consistently on the first deploy, then the metaspace 
> is simply too small for their application, and they need to explicitly 
> increase it in the configuration
> * If the user sees occasionally TaskManagers in a session cluster failing 
> with that exception when deploying new jobs, then some user code or library 
> probably has a class leak. The TM failure / restart is done in order to 
> forcefully clean up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to