[
https://issues.apache.org/jira/browse/FLINK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489353#comment-17489353
]
Piotr Nowojski commented on FLINK-25566:
----------------------------------------
I see your point and I tend to agree, that Flink could/should try to recover
from this kind of situations, or at the very least fail over. However I'm not
sure what we could. Detecting this issue from the outside is maybe impossible,
as error handling can fail while heartbeats will be still working. I'm pretty
sure placing
{code:java}
try { }
catch(NoClassDefFoundError)
{ System.exit(42); }
{code}
Somewhere inside this particular error handling code, won't solve all of the
cases, as we can hit the same problem, but just one method call earlier, above
this try/catch.
As this is not Flink specific issue, I wonder how other projects are
approaching this kind of issues? Maybe we can force JVM to load all flink core
classes first during the start up?
> Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError
> ---------------------------------------------------------------------
>
> Key: FLINK-25566
> URL: https://issues.apache.org/jira/browse/FLINK-25566
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Reporter: Liu
> Priority: Major
> Attachments: image-2022-01-07-19-07-10-968.png,
> image-2022-01-07-19-08-49-038.png, image-2022-01-07-19-11-39-448.png,
> image-2022-01-13-10-45-02-495.png, image-2022-01-13-10-52-56-490.png,
> image-2022-01-13-10-56-10-668.png, taskmanager.log
>
>
> When disk error, the related task will stuck for
> java.lang.NoClassDefFoundError. Our inner flink version is 1.10.0 and we have
> modified some code. The total log and related code is as following. We will
> analysis it with the code below the picture.
> !image-2022-01-13-10-45-02-495.png|width=1708,height=913!
> !image-2022-01-13-10-52-56-490.png|width=896,height=689!
> !image-2022-01-13-10-56-10-668.png|width=820,height=366!
> The process is as following:
> # Disk error occurs.
> # Exception is caught in Task' method doRun.
> # When calling ExceptionUtils.isJvmFatalError(t), another exception
> 'java.lang.NoClassDefFoundError: org/apache/flink/util/ExceptionUtils' is
> thrown.
> # notifyFatalError is called in TaskManagerRunner. I guess that the method
> can not execute because that ExceptionUtils is not found.
> # In Task, notifyFinalState is called finally. Since the state is not
> transferred to failed, the log 'java.lang.IllegalStateException: null' is
> printed.
> Maybe we should catch the exception such as NoClassDefFoundError and call
> terminateJVM() finally.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)