[ 
https://issues.apache.org/jira/browse/FLINK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489353#comment-17489353
 ] 

Piotr Nowojski commented on FLINK-25566:
----------------------------------------

I see your point and I tend to agree, that Flink could/should try to recover 
from this kind of situations, or at the very least fail over. However I'm not 
sure what we could. Detecting this issue from the outside is maybe impossible, 
as error handling can fail while heartbeats will be still working. I'm pretty 
sure placing 
{code:java}
try { }
catch(NoClassDefFoundError) 
{ System.exit(42); }
{code}
Somewhere inside this particular error handling code, won't solve all of the 
cases, as we can hit the same problem, but just one method call earlier, above 
this try/catch.

As this is not Flink specific issue, I wonder how other projects are 
approaching this kind of issues? Maybe we can force JVM to load all flink core 
classes first during the start up?

> Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError
> ---------------------------------------------------------------------
>
>                 Key: FLINK-25566
>                 URL: https://issues.apache.org/jira/browse/FLINK-25566
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>            Reporter: Liu
>            Priority: Major
>         Attachments: image-2022-01-07-19-07-10-968.png, 
> image-2022-01-07-19-08-49-038.png, image-2022-01-07-19-11-39-448.png, 
> image-2022-01-13-10-45-02-495.png, image-2022-01-13-10-52-56-490.png, 
> image-2022-01-13-10-56-10-668.png, taskmanager.log
>
>
> When disk error, the related task will stuck for 
> java.lang.NoClassDefFoundError. Our inner flink version is 1.10.0 and we have 
> modified some code. The total log and related code is as following.  We will 
> analysis it with the code below the picture.
> !image-2022-01-13-10-45-02-495.png|width=1708,height=913!
> !image-2022-01-13-10-52-56-490.png|width=896,height=689!
> !image-2022-01-13-10-56-10-668.png|width=820,height=366!
> The process is as following:
>  # Disk error occurs.
>  # Exception is caught in Task' method doRun.
>  # When calling ExceptionUtils.isJvmFatalError(t), another exception 
> 'java.lang.NoClassDefFoundError: org/apache/flink/util/ExceptionUtils' is 
> thrown.
>  # notifyFatalError is called in TaskManagerRunner. I guess that the method 
> can not execute because that ExceptionUtils is not found.
>  # In Task, notifyFinalState is called finally. Since the state is not 
> transferred to failed, the log 'java.lang.IllegalStateException: null' is 
> printed.
> Maybe we should catch the exception such as NoClassDefFoundError and call 
> terminateJVM() finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to