[
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696230#comment-17696230
]
Roman Khachatryan commented on FLINK-31249:
-------------------------------------------
Allowing to trigger a new checkpoint without unblocking the other (main) thread
doesn't make much sense to me: at least to process the ACKs for that new
checkpoint, the main thread is required.
Ideally, all IO should be done in a separate thread, but we're not there yet. I
don't see a way to interrupt writing metadata generically (for any filesystem).
Rather, specific FS implementations can be configured to tinder out too long
requests.
Besides that, the same filesystem usually stores state backend snapshots and
this metadata. When overloaded, it's more likely that state backend snapshots
will time out first.
> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> -------------------------------------------------------------------
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.6, 1.16.0
> Reporter: Renxiang Zhou
> Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png,
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending
> checkpoint to a completed checkpoint. Currently JM finalizes the pending
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread
> waits for the lock to be released.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not
> execute the timeout event since it is blocked at waiting for the lock. As a
> result, the previous checkpoint cannot be cancelled.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)