[ https://issues.apache.org/jira/browse/FLINK-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dian Fu updated FLINK-16225: ---------------------------- Fix Version/s: (was: 1.10.1) 1.10.2 > Metaspace Out Of Memory should be handled as Fatal Error in TaskManager > ----------------------------------------------------------------------- > > Key: FLINK-16225 > URL: https://issues.apache.org/jira/browse/FLINK-16225 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task > Affects Versions: 1.10.0 > Reporter: Stephan Ewen > Assignee: Andrey Zagrebin > Priority: Critical > Labels: usability > Fix For: 1.11.0, 1.10.2 > > > When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no > way to recover. This is often the result of user code or libraries that have > subtle class loading leaks. > The one way to recover is to kill the TaskManagers and to let the resource > orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should > then be able to recover the job. > I would suggest to implement this the following way: > * The user code ClassLoader takes an "OOM Handler", which is called when > class loading causes an OOM exception. > * The handler wraps this into an Exception with a good error message (see > below) and invokes the TaskManager's {{FatalErrorHandler}}. > * The {{FatalErrorHandler}} in turn should attempt to cancel everything and > notify the JM before shutting down. That way, we get decent error reporting > and users can see what is going on. > The error message should describe the following: > * If user sees the error consistently on the first deploy, then the metaspace > is simply too small for their application, and they need to explicitly > increase it in the configuration > * If the user sees occasionally TaskManagers in a session cluster failing > with that exception when deploying new jobs, then some user code or library > probably has a class leak. The TM failure / restart is done in order to > forcefully clean up. -- This message was sent by Atlassian Jira (v8.3.4#803005)