[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

zlzhang0122 (Jira) Tue, 06 Jul 2021 20:09:30 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376181#comment-17376181
 ]


zlzhang0122 commented on FLINK-23189:
-------------------------------------

sure, [~pnowojski] I have posted a attachment which record the exception thrown 
in Flink 1.10. CheckpointCoordinator#triggerCheckpoint() will call the 
startTriggeringCheckpoint() function, while this function will call the 
initializeCheckpoint() function, this function may throw an IOException(see 
[link|https://github.com/zlzhang0122/flink/blob/9e1cc0ac2bbf0a2e8fcf00e6730a10893d651590/flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java#L83]).
 The IOException will produce a 
CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE just like any other 
Exception, I think that IOException is caused by disk error or any other IO 
problem that can hardly be resumed, and maybe we should treat it a little more 
serious and let users know it faster rather than just log it.

> Count and fail the task when the disk is error on JobManager
> ------------------------------------------------------------
>
>                 Key: FLINK-23189
>                 URL: https://issues.apache.org/jira/browse/FLINK-23189
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.1
>            Reporter: zlzhang0122
>            Priority: Major
>         Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

Reply via email to