[ 
https://issues.apache.org/jira/browse/FLINK-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947662#comment-16947662
 ] 

Biao Liu commented on FLINK-14344:
----------------------------------

When I start implementing this part, I just realized there is one more thing 
need to discuss. I would like to make the semantics of 
{{MasterTriggerRestoreHook#triggerCheckpoint}} clearer.

At the moment {{MasterTriggerRestoreHook}} is not a public official interface 
(at-least from perspective of codes). I'm not sure how big the side-effect 
might be if I changed the interface or behavior.

The problem is that now the interface says (described in Java doc) both 
synchronous and asynchronous implementation are OK. This behavior makes things 
more complex. I can't figure out a proper asynchronous way to handle both of 
these two scenarios at the same time.

I was planning to execute {{MasterTriggerRestoreHook#triggerCheckpoint}} in IO 
executor directly. It's OK for the synchronous scenario. But I found there 
might be a deadlock scenario under asynchronous scenario. Currently the IO 
executor is given to master hook as an input parameter. If we execute it in IO 
executor, and the implementation of 
{{MasterTriggerRestoreHook#triggerCheckpoint}} launches an asynchronous 
operation then waits for the result for some reason. It might be deadlock if 
this asynchronous operation is scheduled in the same IO thread. The waiting 
blocks the later asynchronous operation, it can't finish.

IMO synchronous and asynchronous interfaces should be different, and be treated 
in different ways.
 # The synchronous invocation returns value directly, not a completable future. 
The method is executed in IO thread under a proper lock (could be the hook 
itself). It could be launched by {{CheckpointCoordinator}} directly. The 
implementation of {{MasterTriggerRestoreHook}} could not see the IO executor at 
all. However in this way, we break the compatibility.
 # The asynchronous invocation returns a completable future just like the 
current. The {{MasterTriggerRestoreHook#triggerCheckpoint}} method itself is 
without any IO operation. All heavy IO operations are executed in the IO 
executor which is given to master hook as an input parameter. There is also an 
advantage of this asynchronous interface. We could avoid competition on all 
methods of {{MasterTriggerRestoreHook}} (run in main thread) except the real 
asynchronous part (user must guarantee it is thread-safe or under a proper 
lock) executed in IO thread. In this way, we keep the compatibility on the 
surface. However we change the behavior somewhat. We could emphasize the change 
in Java doc and release note.

If nobody could make sure the side-effect. I could start a survey or a 
discussion in mailing list. What do you think? Any feedback is appreciated.
 cc [~sewen], [~trohrmann], [~pnowojski]

> Snapshot master hook state asynchronously
> -----------------------------------------
>
>                 Key: FLINK-14344
>                 URL: https://issues.apache.org/jira/browse/FLINK-14344
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Biao Liu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently we snapshot the master hook state synchronously. As a part of 
> reworking threading model of {{CheckpointCoordinator}}, we have to make this 
> non-blocking to satisfy the requirement of running in main thread.
> The behavior of snapshotting master hook state is similar to task state 
> snapshotting. Master state snapshotting is taken before task state 
> snapshotting. Because in master hook, there might be external system 
> initialization which task state snapshotting might depends on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to