[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
renxiang zhou updated FLINK-31249: ---------------------------------- Language: (was: JAVA) > Checkpoint Timer failed to process timeout events when it blocked at writing > _metadata to DFS > --------------------------------------------------------------------------------------------- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.11.6, 1.16.0 > Reporter: renxiang zhou > Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png > > > The jobmanager-future thread may be blocked at writing metadata to DFS caused > by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. > When the next Checkpoint is triggered, the Checkpoint Timer thread waits for > the lock to be released. If the previous checkpoint times out, the > checkpoint timer does not execute the timeout event since it is blocked at > waiting for the lock. As a result, the previous checkpoint cannot be > cancelled. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)