[ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renxiang zhou updated FLINK-31249:
----------------------------------
    Description: 
The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by this 
thread. 

When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
for the lock to be released.  If the previous checkpoint times out, the 
{{Checkpoint Timer}} will not execute the timeout event since it is blocked at 
waiting for the lock. As a result, the previous checkpoint cannot be cancelled.

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!

  was:
The jobmanager-future thread may be blocked at writing metadata to DFS caused 
by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. 

When the next Checkpoint is triggered, the Checkpoint Timer thread waits for 
the lock to be released.  If the previous checkpoint times out, the checkpoint 
timer does not execute the timeout event since it is blocked at waiting for the 
lock. As a result, the previous checkpoint cannot be cancelled.

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!


> Checkpoint Timer failed to process timeout events when it blocked at writing 
> _metadata to DFS
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31249
>                 URL: https://issues.apache.org/jira/browse/FLINK-31249
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.6, 1.16.0
>            Reporter: renxiang zhou
>            Priority: Major
>             Fix For: 1.18.0
>
>         Attachments: image-2023-02-28-11-25-03-637.png
>
>
> The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
> caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by 
> this thread. 
> When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
> for the lock to be released.  If the previous checkpoint times out, the 
> {{Checkpoint Timer}} will not execute the timeout event since it is blocked 
> at waiting for the lock. As a result, the previous checkpoint cannot be 
> cancelled.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to