[
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695918#comment-17695918
]
Roman Khachatryan commented on FLINK-31249:
-------------------------------------------
[~mayuehappy] , [~zhourenxiang] ,
on the images I see that CheckpointCoordinator.chooseRequestToExecute is
waiting for the last checkpoint to be finalized. This is intentional to avoid
concurrency issues.
IIUC, checkpoint finalization is paused artificially.
Are you observing any issues with that in non-mocked setup?
> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> -------------------------------------------------------------------
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.6, 1.16.0
> Reporter: renxiang zhou
> Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png,
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending
> checkpoint to a completed checkpoint. Currently JM finalizes the pending
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread
> waits for the lock to be released.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not
> execute the timeout event since it is blocked at waiting for the lock. As a
> result, the previous checkpoint cannot be cancelled.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)