[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

Renxiang Zhou (Jira) Fri, 03 Mar 2023 03:26:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696116#comment-17696116
 ]


Renxiang Zhou commented on FLINK-31249:
---------------------------------------

[~roman] If it takes too long to finalize the checkpoint metadata, it usually 
means that there is a problem with the external storage service (in HDFS, it 
could happen when writing to a slow DataNode). In this case, we can retry 
writing a new metadata to DFS or just discard this checkpoint and make another 
one, rather than leaving the checkpoint stuck. What do you think of it ?

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> -------------------------------------------------------------------
>
>                 Key: FLINK-31249
>                 URL: https://issues.apache.org/jira/browse/FLINK-31249
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.6, 1.16.0
>            Reporter: Renxiang Zhou
>            Priority: Major
>             Fix For: 1.18.0
>
>         Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

Reply via email to