[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495420#comment-17495420
 ] 

Piotr Nowojski commented on FLINK-26049:
----------------------------------------

Hi [~fanrui], I would suggest maybe to slow down here a bit and think more 
about how do we want to treat failures on the {{CheckpointCoordinator}}. Is 
this really a bug? So far we only committed ourselves to check IOExceptions on 
the CheckpointCoordinator against the tolerable failed checkpoints counter. We 
have never claimed that any other types of exceptions will be treated the same 
way.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-26049
>                 URL: https://issues.apache.org/jira/browse/FLINK-26049
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.13.5, 1.14.3
>            Reporter: fanrui
>            Assignee: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>         Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to