[ 
https://issues.apache.org/jira/browse/FLINK-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134109#comment-17134109
 ] 

Piotr Nowojski commented on FLINK-14971:
----------------------------------------

{quote}
1. The asynchronous committing of CompletedCheckpointStore must be done first, 
then CheckpointCoordinator notifies tasks that the checkpoint is completed. 
{quote}
Yes, this would maintain the current behaviour % currently we are waiting 
synchronously for adding checkpoint to {{CompletedCheckpointStore}} to finish.
{quote}
2. If job fails before asynchronous committing completes, CheckpointCoordinator 
needs to decide how to handle this committing. When committing completes, JM 
might be stuck in restoring or other steps (like cancelling tasks). 
(...)
Option B is treating this checkpoint as a successful one but do not notify 
tasks, because tasks are cancelling or waiting to be restarted, it's 
meaningless. I think option B is simpler and better and also acceptable because 
the notification of checkpoint completing is not guaranteed anyway.
{quote}
Yes, this is what I think we should be doing. It's the same case as we have 
currently, that JM fails after adding completed checkpoint to ZK, before 
sending some/all of the notifications.

> Make all the non-IO operations in CheckpointCoordinator single-threaded
> -----------------------------------------------------------------------
>
>                 Key: FLINK-14971
>                 URL: https://issues.apache.org/jira/browse/FLINK-14971
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Biao Liu
>            Assignee: Biao Liu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently the ACK and declined message handling are executed in IO thread. 
> This is the only rest part that non-IO operations are executed in IO thread. 
> It blocks introducing main thread executor for {{CheckpointCoordinator}}. It 
> would be resolved in this task.
> After resolving the ACK and declined message issue, the main thread executor 
> would be introduced into {{CheckpointCoordinator}} to instead of timer 
> thread. However the timer thread would be kept (maybe for a while 
> temporarily) to schedule periodic triggering, since FLINK-13848 is not 
> accepted yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to