[
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
renxiang zhou updated FLINK-31249:
----------------------------------
Description:
When jobmanager receives all ACKs of tasks, it will finalize the pending
checkpoint to a completed checkpoint. Currently JM finalizes the pending
checkpoint with holding the checkpoint coordinator lock.
When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at
finalizing the pending checkpoint.
!image-2023-02-28-12-17-19-607.png|width=1010,height=244!
And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread
waits for the lock to be released.
!image-2023-02-28-11-25-03-637.png|width=1144,height=248!
If the previous checkpoint times out, the {{Checkpoint Timer}} will not execute
the timeout event since it is blocked at waiting for the lock. As a result, the
previous checkpoint cannot be cancelled.
was:
The {{jobmanager-future}} thread may be blocked at writing metadata to DFS
caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by this
thread.
When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits
for the lock to be released. If the previous checkpoint times out, the
{{Checkpoint Timer}} will not execute the timeout event since it is blocked at
waiting for the lock. As a result, the previous checkpoint cannot be cancelled.
!image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> -------------------------------------------------------------------
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.6, 1.16.0
> Reporter: renxiang zhou
> Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png,
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending
> checkpoint to a completed checkpoint. Currently JM finalizes the pending
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread
> waits for the lock to be released.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not
> execute the timeout event since it is blocked at waiting for the lock. As a
> result, the previous checkpoint cannot be cancelled.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)