[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

Roman Khachatryan (Jira) Mon, 06 Mar 2023 00:46:37 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696770#comment-17696770
 ]


Roman Khachatryan commented on FLINK-31249:
-------------------------------------------

That's doable by writing metadata in a separate (IO) thread and waiting for a 
result with a timeout.

But I'm not sure whether that wouldn't do more harm than good:
 * most of the work was already done by this point (snapshotting the tasks), 
and timing out writing the metadata file (usually small) will discard and start 
it over; that essentially delays the checkpoint
 * and if the timeout is caused by the overload then that next checkpoint is 
much less likely to succeed (because it needs to discard the state written, 
upload it again, write metadata again)
 * in a more narrow case, when it's the IO thread pool that is overloaded (but 
not the IO) - it will be a pure regression

 So I'd avoid such a change without a real world use case.

 

Could you elaborate why the above proposal
{quote}Rather, specific FS implementations can be configured to tinder out too 
long requests.
{quote}
doesn't work in your case?

 

As for the alerts, it should also possible to have them when there are no 
datapoints about recent checkpoints.

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> -------------------------------------------------------------------
>
>                 Key: FLINK-31249
>                 URL: https://issues.apache.org/jira/browse/FLINK-31249
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.6, 1.16.0
>            Reporter: Renxiang Zhou
>            Priority: Major
>             Fix For: 1.18.0
>
>         Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

Reply via email to