[
https://issues.apache.org/jira/browse/FLINK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430011#comment-17430011
]
Piotr Nowojski edited comment on FLINK-24401 at 10/19/21, 7:02 AM:
-------------------------------------------------------------------
[~fanrui], can you post the actual second's OOM stack trace?
As the error was caused by
{{org.apache.flink.runtime.taskexecutor.TaskManagerRunner.Result}}, I wonder if
we should actually just treat all OOMs the sam way, regardless if it's meta
space or not. [~trohrmann], what do you think? I'm asking as you were doing the
review of that change, and I don't know the motivation behind it. Was it just
best effort thing that we added, just because we thought we could? Or was it
addressing some problem that users were complaining about?
{quote}
In case of Metaspace OOM error, we try a graceful TM shutdown to notify JM
because it is not expected to require class loading for that and cause further
failures.
{quote}
After all this assumption is clearly in the wrong.
was (Author: pnowojski):
[~fanrui], can you post the actual second's OOM stack trace?
As the error was caused by
{{org.apache.flink.runtime.taskexecutor.TaskManagerRunner.Result}}, I wonder if
we should actually just treat all OOMs the sam way, regardless if it's meta
space or not. [~trohrmann], what do you think? I'm asking as you were doing the
review of that change, and I don't know the motivation behind it. Was it just
best effort thing that we added, just because we thought we could?
{quote}
In case of Metaspace OOM error, we try a graceful TM shutdown to notify JM
because it is not expected to require class loading for that and cause further
failures.
{quote}
After all this assumption is clearly in the wrong.
> TM cannot exit after Metaspace OOM
> ----------------------------------
>
> Key: FLINK-24401
> URL: https://issues.apache.org/jira/browse/FLINK-24401
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Task
> Affects Versions: 1.12.0, 1.13.0
> Reporter: future
> Priority: Major
> Fix For: 1.14.1, 1.13.4
>
> Attachments: image-2021-09-29-12-00-28-510.png,
> image-2021-09-29-12-00-44-812.png
>
>
> Hi masters, from the code and log, we can see that OOM will terminateJVM
> directly, but Metaspace OutOfMemoryError will graceful shutdown. The code
> comment mentions: {{_it does not usually require more class loading to fail
> again with the Metaspace OutOfMemoryError_.}}.
> But we encountered: after Metaspace OutOfMemoryError,
> {{_java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$Result_.}}, makes Tm
> unable to exit, keeps trying again, keeps NoClassDefFoundError, keeps class
> loading failure, until kill tm by manually.
> I want to add a catch Throwable in the onFatalError method, and directly
> terminateJVM() in the catch. Is there any problem with this strategy?
>
> [code link
> |https://github.com/apache/flink/blob/4fe9f525a92319acc1e3434bebed601306f7a16f/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L312]
> picture:
>
> !image-2021-09-29-12-00-44-812.png|width=1337,height=692!
> !image-2021-09-29-12-00-28-510.png!
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)