[
https://issues.apache.org/jira/browse/FLINK-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947662#comment-16947662
]
Biao Liu commented on FLINK-14344:
----------------------------------
When I start implementing this part, I just realized there is one more thing
need to discuss. I would like to make the semantics of
{{MasterTriggerRestoreHook#triggerCheckpoint}} clearer.
At the moment {{MasterTriggerRestoreHook}} is not a public official interface
(at-least from perspective of codes). I'm not sure how big the side-effect
might be if I changed the interface or behavior.
The problem is that now the interface says (described in Java doc) both
synchronous and asynchronous implementation are OK. This behavior makes things
more complex. I can't figure out a proper asynchronous way to handle both of
these two scenarios at the same time.
I was planning to execute {{MasterTriggerRestoreHook#triggerCheckpoint}} in IO
executor directly. It's OK for the synchronous scenario. But I found there
might be a deadlock scenario under asynchronous scenario. Currently the IO
executor is given to master hook as an input parameter. If we execute it in IO
executor, and the implementation of
{{MasterTriggerRestoreHook#triggerCheckpoint}} launches an asynchronous
operation then waits for the result for some reason. It might be deadlock if
this asynchronous operation is scheduled in the same IO thread. The waiting
blocks the later asynchronous operation, it can't finish.
IMO synchronous and asynchronous interfaces should be different, and be treated
in different ways.
# The synchronous invocation returns value directly, not a completable future.
The method is executed in IO thread under a proper lock (could be the hook
itself). It could be launched by {{CheckpointCoordinator}} directly. The
implementation of {{MasterTriggerRestoreHook}} could not see the IO executor at
all. However in this way, we break the compatibility.
# The asynchronous invocation returns a completable future just like the
current. The {{MasterTriggerRestoreHook#triggerCheckpoint}} method itself is
without any IO operation. All heavy IO operations are executed in the IO
executor which is given to master hook as an input parameter. There is also an
advantage of this asynchronous interface. We could avoid competition on all
methods of {{MasterTriggerRestoreHook}} (run in main thread) except the real
asynchronous part (user must guarantee it is thread-safe or under a proper
lock) executed in IO thread. In this way, we keep the compatibility on the
surface. However we change the behavior somewhat. We could emphasize the change
in Java doc and release note.
If nobody could make sure the side-effect. I could start a survey or a
discussion in mailing list. What do you think? Any feedback is appreciated.
cc [~sewen], [~trohrmann], [~pnowojski]
> Snapshot master hook state asynchronously
> -----------------------------------------
>
> Key: FLINK-14344
> URL: https://issues.apache.org/jira/browse/FLINK-14344
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Checkpointing
> Reporter: Biao Liu
> Priority: Major
> Fix For: 1.10.0
>
>
> Currently we snapshot the master hook state synchronously. As a part of
> reworking threading model of {{CheckpointCoordinator}}, we have to make this
> non-blocking to satisfy the requirement of running in main thread.
> The behavior of snapshotting master hook state is similar to task state
> snapshotting. Master state snapshotting is taken before task state
> snapshotting. Because in master hook, there might be external system
> initialization which task state snapshotting might depends on.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)