[
https://issues.apache.org/jira/browse/FLINK-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135918#comment-17135918
]
Andrey Zagrebin edited comment on FLINK-16225 at 6/15/20, 2:33 PM:
-------------------------------------------------------------------
merge into master by bac55c175b0d3a76395880d7aa2e5ae6484364fa
merge into 1.11 by 435ec274129dbef84b4d93526f07d2e2c6332585
decided not to merge into 1.10, see details
[here|https://github.com/apache/flink/pull/12563#issuecomment-644170325]
was (Author: azagrebin):
merge into master by bac55c175b0d3a76395880d7aa2e5ae6484364fa
merge into 1.11 by 435ec274129dbef84b4d93526f07d2e2c6332585
decided not to merge into 1.10, see details
[here|[https://github.com/apache/flink/pull/12563#issuecomment-644170325]
> Metaspace Out Of Memory should be handled as Fatal Error in TaskManager
> -----------------------------------------------------------------------
>
> Key: FLINK-16225
> URL: https://issues.apache.org/jira/browse/FLINK-16225
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Reporter: Stephan Ewen
> Assignee: Andrey Zagrebin
> Priority: Critical
> Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>
> When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no
> way to recover. This is often the result of user code or libraries that have
> subtle class loading leaks.
> The one way to recover is to kill the TaskManagers and to let the resource
> orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should
> then be able to recover the job.
> I would suggest to implement this the following way:
> * The user code ClassLoader takes an "OOM Handler", which is called when
> class loading causes an OOM exception.
> * The handler wraps this into an Exception with a good error message (see
> below) and invokes the TaskManager's {{FatalErrorHandler}}.
> * The {{FatalErrorHandler}} in turn should attempt to cancel everything and
> notify the JM before shutting down. That way, we get decent error reporting
> and users can see what is going on.
> The error message should describe the following:
> * If user sees the error consistently on the first deploy, then the metaspace
> is simply too small for their application, and they need to explicitly
> increase it in the configuration
> * If the user sees occasionally TaskManagers in a session cluster failing
> with that exception when deploying new jobs, then some user code or library
> probably has a class leak. The TM failure / restart is done in order to
> forcefully clean up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)