[ 
https://issues.apache.org/jira/browse/FLINK-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871499#comment-15871499
 ] 

ASF GitHub Bot commented on FLINK-5703:
---------------------------------------

GitHub user wangzhijiang999 opened a pull request:

    https://github.com/apache/flink/pull/3340

    [FLINK-5703][runtime]Job manager failure recovery via reconciliation with 
TaskManager reports

    This is part of [Non-disruptive JobManager Failures via Reconciliation 
](https://issues.apache.org/jira/browse/FLINK-4911). 
    
    The design doc for this part is attached in the JIRA and it mainly contains 
the following work:
    
    - **RECONCILING** state transition for job status and execution.
    - The data structure recovery for **ExecutionGraph**, 
**ExecutionVertex**,**Execution** based on **TaskManager** reports. 
    
    Two parts are left for the whole feature:
    
    - The related modifications in **TaskManger** side will be submitted in 
another JIRA, including not cancel the tasks when notified job leader changed 
and report the task status to **JobManager**, etc.
    - This PR will not make an effect on the current master branch, so it is 
safe to merge. The reconcile logic should be triggered by **JobManagerRunner** 
when grants leadership, but it is dependent on [Determine whether the job 
starts from last JobManager 
failure](https://issues.apache.org/jira/browse/FLINK-5501) , so I will modify 
the **JobManagerRunner** logic after 
[FLINK-5501](https://issues.apache.org/jira/browse/FLINK-5501) is merged.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangzhijiang999/flink FLINK-5703

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3340.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3340
    
----
commit 7337754bd8d12808a328b55e184b0fe59cf6e11d
Author: 淘江 <[email protected]>
Date:   2017-02-17T08:41:44Z

    [FLINK-5703][runtime]Job manager failure recovery based on reconciliation 
with TaskManager reports

----


> ExecutionGraph recovery based on reconciliation with TaskManager reports
> ------------------------------------------------------------------------
>
>                 Key: FLINK-5703
>                 URL: https://issues.apache.org/jira/browse/FLINK-5703
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination, JobManager
>            Reporter: Zhijiang Wang
>            Assignee: Zhijiang Wang
>
> The ExecutionGraph structure would be recovered from TaskManager reports 
> during reconciling period, and the necessary information includes:
>     - Execution: ExecutionAttemptID, AttemptNumber, StartTimestamp, 
> ExecutionState, SimpleSlot, PartialInputChannelDeploymentDescriptor(Consumer 
> Execution)
>     - ExecutionVertex: Map<IntermediateResultPartitionID, 
> IntermediateResultPartition>
>     - ExecutionGraph: ConcurrentHashMap<ExecutionAttemptID, Execution>
> For {{RECONCILING}} ExecutionState, it should be transition into any existing 
> task states ({{RUNNING}},{{CANCELED}},{{FAILED}},{{FINISHED}}). To do so, the 
> TaskManger should maintain the terminal task state 
> ({{CANCELED}},{{FAILED}},{{FINISHED}}) for a while and we try to realize this 
> mechanism in another jira. In addition, the state transition would trigger 
> different actions, and some actions rely on above necessary information. 
> Considering this limit, the recovery process will be divided into two steps:
>     - First, recovery all other necessary information except ExecutionState.
>     - Second, transition ExecutionState into real task state and trigger 
> actions. The behavior is the same with current {{UpdateTaskExecutorState}}.
> To make logic easy and consistency, during recovery period, all the other RPC 
> messages ({{UpdateTaskExecutionState}}, {{ScheduleOrUpdateConsumers}},etc) 
> from TaskManager should be refused temporarily and responded with a special 
> message by JobMaster. Then the TaskManager should retry to send these 
> messages later until JobManager ends recovery and acknowledgement.
> For {{RECONCILING}} JobStatus, it would be transition into one of the states 
> ({{RUNNING}},{{FAILING}},{{FINISHED}}) after recovery.
>     - {{RECONCILING}} to {{RUNNING}}: All the TaskManager report within 
> duration time and all the tasks are in {{RUNNING}} states.
>     - {{RECONCILING}} to {{FAILING}}: One of the TaskManager does not report 
> in time, or one of the tasks state is in {{FAILED}} or {{CANCELED}}
>     - {{RECONCILING}} to {{FINISHED}}: All the TaskManger report within 
> duration time and all the tasks are in {{FINISHED}} states.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to