[
https://issues.apache.org/jira/browse/FLINK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871499#comment-15871499
]
ASF GitHub Bot commented on FLINK-5703:
---------------------------------------
GitHub user wangzhijiang999 opened a pull request:
https://github.com/apache/flink/pull/3340
[FLINK-5703][runtime]Job manager failure recovery via reconciliation with
TaskManager reports
This is part of [Non-disruptive JobManager Failures via Reconciliation
](https://issues.apache.org/jira/browse/FLINK-4911).
The design doc for this part is attached in the JIRA and it mainly contains
the following work:
- **RECONCILING** state transition for job status and execution.
- The data structure recovery for **ExecutionGraph**,
**ExecutionVertex**,**Execution** based on **TaskManager** reports.
Two parts are left for the whole feature:
- The related modifications in **TaskManger** side will be submitted in
another JIRA, including not cancel the tasks when notified job leader changed
and report the task status to **JobManager**, etc.
- This PR will not make an effect on the current master branch, so it is
safe to merge. The reconcile logic should be triggered by **JobManagerRunner**
when grants leadership, but it is dependent on [Determine whether the job
starts from last JobManager
failure](https://issues.apache.org/jira/browse/FLINK-5501) , so I will modify
the **JobManagerRunner** logic after
[FLINK-5501](https://issues.apache.org/jira/browse/FLINK-5501) is merged.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangzhijiang999/flink FLINK-5703
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3340.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3340
----
commit 7337754bd8d12808a328b55e184b0fe59cf6e11d
Author: 淘江 <[email protected]>
Date: 2017-02-17T08:41:44Z
[FLINK-5703][runtime]Job manager failure recovery based on reconciliation
with TaskManager reports
----
> ExecutionGraph recovery based on reconciliation with TaskManager reports
> ------------------------------------------------------------------------
>
> Key: FLINK-5703
> URL: https://issues.apache.org/jira/browse/FLINK-5703
> Project: Flink
> Issue Type: Sub-task
> Components: Distributed Coordination, JobManager
> Reporter: Zhijiang Wang
> Assignee: Zhijiang Wang
>
> The ExecutionGraph structure would be recovered from TaskManager reports
> during reconciling period, and the necessary information includes:
> - Execution: ExecutionAttemptID, AttemptNumber, StartTimestamp,
> ExecutionState, SimpleSlot, PartialInputChannelDeploymentDescriptor(Consumer
> Execution)
> - ExecutionVertex: Map<IntermediateResultPartitionID,
> IntermediateResultPartition>
> - ExecutionGraph: ConcurrentHashMap<ExecutionAttemptID, Execution>
> For {{RECONCILING}} ExecutionState, it should be transition into any existing
> task states ({{RUNNING}},{{CANCELED}},{{FAILED}},{{FINISHED}}). To do so, the
> TaskManger should maintain the terminal task state
> ({{CANCELED}},{{FAILED}},{{FINISHED}}) for a while and we try to realize this
> mechanism in another jira. In addition, the state transition would trigger
> different actions, and some actions rely on above necessary information.
> Considering this limit, the recovery process will be divided into two steps:
> - First, recovery all other necessary information except ExecutionState.
> - Second, transition ExecutionState into real task state and trigger
> actions. The behavior is the same with current {{UpdateTaskExecutorState}}.
> To make logic easy and consistency, during recovery period, all the other RPC
> messages ({{UpdateTaskExecutionState}}, {{ScheduleOrUpdateConsumers}},etc)
> from TaskManager should be refused temporarily and responded with a special
> message by JobMaster. Then the TaskManager should retry to send these
> messages later until JobManager ends recovery and acknowledgement.
> For {{RECONCILING}} JobStatus, it would be transition into one of the states
> ({{RUNNING}},{{FAILING}},{{FINISHED}}) after recovery.
> - {{RECONCILING}} to {{RUNNING}}: All the TaskManager report within
> duration time and all the tasks are in {{RUNNING}} states.
> - {{RECONCILING}} to {{FAILING}}: One of the TaskManager does not report
> in time, or one of the tasks state is in {{FAILED}} or {{CANCELED}}
> - {{RECONCILING}} to {{FINISHED}}: All the TaskManger report within
> duration time and all the tasks are in {{FINISHED}} states.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)