[
https://issues.apache.org/jira/browse/FLINK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488133#comment-17488133
]
Piotr Nowojski commented on FLINK-25566:
----------------------------------------
Hey [~Jiangang]. I'm not sure what would you like us to change? Is the primary
thing to address in this ticket that the disk error occurred? That the
`NoClassDefFoundError` has been thrown? Or that the Task manager wasn't able to
failover?
As I understand it (please correct me if I'm wrong). The disk errors are some
external, non related to Flink errors. As an indirect result of those errors,
some classes were not able to be loaded (not a Flink bug) and
`NoClassDefFoundError` prevented TM from failing over? Is that what's
happening? And the discussion here is that we should add `try/catch` for
`NoClassDefFoundError` and kill the TM regardless? If so, I would be against
complicating the code and Flink trying to recover from such weird errors. For
example if there has been `NoClassDefFoundError` thrown for missing
`ExceptionUtils`, what make us think that any other code TM will be able to
execute? As far as I can tell, even
`org.apache.flink.runtime.taskexecutor.TaskManagerRunner#onFatalError` could
fail, as it's using `TaskManagerExceptionUtils`, `FlinkSecurityManager`. They
might be missing as well...
> Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError
> ---------------------------------------------------------------------
>
> Key: FLINK-25566
> URL: https://issues.apache.org/jira/browse/FLINK-25566
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Reporter: Liu
> Priority: Major
> Attachments: image-2022-01-07-19-07-10-968.png,
> image-2022-01-07-19-08-49-038.png, image-2022-01-07-19-11-39-448.png,
> image-2022-01-13-10-45-02-495.png, image-2022-01-13-10-52-56-490.png,
> image-2022-01-13-10-56-10-668.png, taskmanager.log
>
>
> When disk error, the related task will stuck for
> java.lang.NoClassDefFoundError. Our inner flink version is 1.10.0 and we have
> modified some code. The total log and related code is as following. We will
> analysis it with the code below the picture.
> !image-2022-01-13-10-45-02-495.png|width=1708,height=913!
> !image-2022-01-13-10-52-56-490.png|width=896,height=689!
> !image-2022-01-13-10-56-10-668.png|width=820,height=366!
> The process is as following:
> # Disk error occurs.
> # Exception is caught in Task' method doRun.
> # When calling ExceptionUtils.isJvmFatalError(t), another exception
> 'java.lang.NoClassDefFoundError: org/apache/flink/util/ExceptionUtils' is
> thrown.
> # notifyFatalError is called in TaskManagerRunner. I guess that the method
> can not execute because that ExceptionUtils is not found.
> # In Task, notifyFinalState is called finally. Since the state is not
> transferred to failed, the log 'java.lang.IllegalStateException: null' is
> printed.
> Maybe we should catch the exception such as NoClassDefFoundError and call
> terminateJVM() finally.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)