[
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376181#comment-17376181
]
zlzhang0122 commented on FLINK-23189:
-------------------------------------
sure, [~pnowojski] I have posted a attachment which record the exception thrown
in Flink 1.10. CheckpointCoordinator#triggerCheckpoint() will call the
startTriggeringCheckpoint() function, while this function will call the
initializeCheckpoint() function, this function may throw an IOException(see
[link|https://github.com/zlzhang0122/flink/blob/9e1cc0ac2bbf0a2e8fcf00e6730a10893d651590/flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java#L83]).
The IOException will produce a
CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE just like any other
Exception, I think that IOException is caused by disk error or any other IO
problem that can hardly be resumed, and maybe we should treat it a little more
serious and let users know it faster rather than just log it.
> Count and fail the task when the disk is error on JobManager
> ------------------------------------------------------------
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.1
> Reporter: zlzhang0122
> Priority: Major
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this
> failure won't cause Job failed. Users can hardly find this error if he don't
> see the JobManager logs. To avoid this case, I propose that we can figure out
> these IOException case and increase the failureCounter which can fail the job
> finally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)