[
https://issues.apache.org/jira/browse/FLINK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489330#comment-17489330
]
Liu commented on FLINK-25566:
-----------------------------
[~pnowojski] Thanks for the reply. I agree with you that the trick way is not
suitable. In my initial thought, I just want to verify the problem. Then we can
discuss whether there exist a good solution. For example, we can fail the tasks
and kill the taskmanager from the jobmaster side if stucking too long. Or we
can setup a thread called HealthChecker to check the taskmanager's health, for
example, reading and writing to a disk. Disk error is an extreme case but
nobody can avoid it in production. Maybe we can do something to decrease the
bad effect by it.
> Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError
> ---------------------------------------------------------------------
>
> Key: FLINK-25566
> URL: https://issues.apache.org/jira/browse/FLINK-25566
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Reporter: Liu
> Priority: Major
> Attachments: image-2022-01-07-19-07-10-968.png,
> image-2022-01-07-19-08-49-038.png, image-2022-01-07-19-11-39-448.png,
> image-2022-01-13-10-45-02-495.png, image-2022-01-13-10-52-56-490.png,
> image-2022-01-13-10-56-10-668.png, taskmanager.log
>
>
> When disk error, the related task will stuck for
> java.lang.NoClassDefFoundError. Our inner flink version is 1.10.0 and we have
> modified some code. The total log and related code is as following. We will
> analysis it with the code below the picture.
> !image-2022-01-13-10-45-02-495.png|width=1708,height=913!
> !image-2022-01-13-10-52-56-490.png|width=896,height=689!
> !image-2022-01-13-10-56-10-668.png|width=820,height=366!
> The process is as following:
> # Disk error occurs.
> # Exception is caught in Task' method doRun.
> # When calling ExceptionUtils.isJvmFatalError(t), another exception
> 'java.lang.NoClassDefFoundError: org/apache/flink/util/ExceptionUtils' is
> thrown.
> # notifyFatalError is called in TaskManagerRunner. I guess that the method
> can not execute because that ExceptionUtils is not found.
> # In Task, notifyFinalState is called finally. Since the state is not
> transferred to failed, the log 'java.lang.IllegalStateException: null' is
> printed.
> Maybe we should catch the exception such as NoClassDefFoundError and call
> terminateJVM() finally.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)