[
https://issues.apache.org/jira/browse/FLINK-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrey Zagrebin updated FLINK-16225:
------------------------------------
Fix Version/s: (was: 1.10.1)
> Metaspace Out Of Memory should be handled as Fatal Error in TaskManager
> -----------------------------------------------------------------------
>
> Key: FLINK-16225
> URL: https://issues.apache.org/jira/browse/FLINK-16225
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Reporter: Stephan Ewen
> Assignee: Andrey Zagrebin
> Priority: Critical
> Labels: usability
> Fix For: 1.11.0
>
>
> When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no
> way to recover. This is often the result of user code or libraries that have
> subtle class loading leaks.
> The one way to recover is to kill the TaskManagers and to let the resource
> orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should
> then be able to recover the job.
> I would suggest to implement this the following way:
> * The user code ClassLoader takes an "OOM Handler", which is called when
> class loading causes an OOM exception.
> * The handler wraps this into an Exception with a good error message (see
> below) and invokes the TaskManager's {{FatalErrorHandler}}.
> * The {{FatalErrorHandler}} in turn should attempt to cancel everything and
> notify the JM before shutting down. That way, we get decent error reporting
> and users can see what is going on.
> The error message should describe the following:
> * If user sees the error consistently on the first deploy, then the metaspace
> is simply too small for their application, and they need to explicitly
> increase it in the configuration
> * If the user sees occasionally TaskManagers in a session cluster failing
> with that exception when deploying new jobs, then some user code or library
> probably has a class leak. The TM failure / restart is done in order to
> forcefully clean up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)