[
https://issues.apache.org/jira/browse/FLINK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437392#comment-17437392
]
Till Rohrmann commented on FLINK-24401:
---------------------------------------
I think the assumption was that a meta space OOM mainly occurs when loading
user code (additional classes). That's why we thought that most of Flink
related things can still work because they were loaded before. Clearly, this
does not seem to hold true. I'd be fine with failing hard in case of a meta
space OOM. If we want to still provide the old behaviour, then we could make
the exit behaviour configurable with default to fail hard.
> TM cannot exit after Metaspace OOM
> ----------------------------------
>
> Key: FLINK-24401
> URL: https://issues.apache.org/jira/browse/FLINK-24401
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Task
> Affects Versions: 1.12.0, 1.13.0
> Reporter: future
> Priority: Major
> Fix For: 1.14.1, 1.13.4
>
> Attachments: image-2021-09-29-12-00-28-510.png,
> image-2021-09-29-12-00-44-812.png
>
>
> Hi masters, from the code and log, we can see that OOM will terminateJVM
> directly, but Metaspace OutOfMemoryError will graceful shutdown. The code
> comment mentions: {{_it does not usually require more class loading to fail
> again with the Metaspace OutOfMemoryError_.}}.
> But we encountered: after Metaspace OutOfMemoryError,
> {{_java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$Result_.}}, makes Tm
> unable to exit, keeps trying again, keeps NoClassDefFoundError, keeps class
> loading failure, until kill tm by manually.
> I want to add a catch Throwable in the onFatalError method, and directly
> terminateJVM() in the catch. Is there any problem with this strategy?
>
> [code link
> |https://github.com/apache/flink/blob/4fe9f525a92319acc1e3434bebed601306f7a16f/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L312]
> picture:
>
> !image-2021-09-29-12-00-44-812.png|width=1337,height=692!
> !image-2021-09-29-12-00-28-510.png!
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)