[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377186#comment-17377186
 ] 

Piotr Nowojski commented on FLINK-23189:
----------------------------------------

Thanks for the more detailed explanation. I think your request makes sense. I 
think currently those kind of failures are just logged in 
{{CheckpointCoordinator#onTriggerFailure()}} while they should be checked 
against {{CheckpointFailureManager}} and it should be deciding whether the 
error should be just logged, or checked against the number of tolerable 
failures and maybe fail the job.

So as a part of this ticket, I would expect someone to go through the current 
exceptions (including all occurrences of {{TRIGGER_CHECKPOINT_FAILURE}}) and 
decide which should be ignored/logged and which can cause job failover, 
potentially splitting {{TRIGGER_CHECKPOINT_FAILURE}} into new failure reasons 
and implement it accordingly in the {{CheckpointFailureManager}}.

Additionally it would be good to check if other failure reasons are treated 
sensibly in the {{CheckpointFailureManager}}.

I'm also afraid that this change would cause quite a bit of test instabilities, 
so it might turn somewhat more difficult than it looks at the first glance.

[~zlzhang0122] would you be willing to work on this issue? Or you just wanted 
to propose an idea for us to pick up at some point of time in the future?

> Count and fail the task when the disk is error on JobManager
> ------------------------------------------------------------
>
>                 Key: FLINK-23189
>                 URL: https://issues.apache.org/jira/browse/FLINK-23189
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.1
>            Reporter: zlzhang0122
>            Priority: Major
>         Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to