[
https://issues.apache.org/jira/browse/FLINK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen closed FLINK-17781.
--------------------------------
> OperatorCoordinator Context must support calls from thread other than
> JobMaster Main Thread
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-17781
> URL: https://issues.apache.org/jira/browse/FLINK-17781
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Currently, calls on the Context in the OperatorCoordinator go directly
> synchronously to the ExcutionGraph.
> There are two critical problems are:
> - It is common that the code in the OperatorCoordinator runs in a separate
> thread (for example, because it executes blocking operations). Calling the
> scheduler from another thread causes the Scheduler to crash (Assertion Error,
> violation of single threaded property)
> - Calls on the ExecutionGraph are removed as part of removing the legacy
> scheduler. Certain calls do not work any more.
> +Problem Level 1:+
> The solution would be to pass in the scheduler and a main thread executor to
> interact with it.
> However, to do that the scheduler needs to be created before the
> OperatorCoordinators are created. One could do that by creating the
> Coordinators lazily after the Scheduler.
> +Problem Level 2:+
> The Scheduler restores the savepoints as part of the scheduler creation, when
> the ExecutionGraph and the CheckpointCoordinator are created early in the
> constructor.
> (Side note: That design is tricky in itself, because it means state is
> restored before the scheduler is even properly constructed.)
> That means the OperatorCoordinator needs to exist (or an in placeholder
> component needs to exist) to accept the restored state.
> That brings us to a cyclic dependency:
> - OperatorCoordinator (context) needs Scheduler and MainThreadExecutor
> - Scheduler and MainThreadExecutor need constructed ExecutionGraph
> - ExecutionGraph needs CheckpointCoordinator
> - CheckpointCoordinator needs OperatorCoordinator
> +Breaking the Cycle+
> The only way we can do this is with a form of lazy initialization:
> - We eagerly create the OperatorCoordinators so they exist for state restore
> - We provide an uninitialized context to them
> - When the Scheduler is started (after leadership is granted) we initialize
> the context with the (then readily constructed) Scheduler and
> MainThreadExecutor
> +Longer-term Solution+
> The longer term solution would require a major change in the Scheduler and
> CheckpointCoordinator setup. Something like this:
> - Scheduler (and ExecutionGraph) are constructed first
> - JobMaster waits for leadership
> - Upon leader grant, Operator Coordinators are constructed and can
> reference the Scheduler and FencedMainThreadExecutor
> - CheckpointCoordinator is constructed and references ExecutionGraph and
> OperatorCoordinators
> - Savepoint or latest checkpoint is restored
> The implementation of the current should try to couple parts as loosely as
> possible to make it easy to implement the above approach later.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)