[
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374788#comment-17374788
]
Piotr Nowojski commented on FLINK-23189:
----------------------------------------
Thank you for reporting the problem [~zlzhang0122]. Could you maybe share an
example stack trace/log entry that you are referring to and what types of the
exceptions you would like to propose to check against the max tolerable
checkpoint failures counter? At a first glance I can not see from where in
{{CheckpointCoordinator#triggerCheckpoint()}} an {{IOException}} can be thrown.
> Count and fail the task when the disk is error on JobManager
> ------------------------------------------------------------
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.1
> Reporter: zlzhang0122
> Priority: Major
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this
> failure won't cause Job failed. Users can hardly find this error if he don't
> see the JobManager logs. To avoid this case, I propose that we can figure out
> these IOException case and increase the failureCounter which can fail the job
> finally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)