[ 
https://issues.apache.org/jira/browse/FLINK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318827#comment-16318827
 ] 

ASF GitHub Bot commented on FLINK-4816:
---------------------------------------

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    I think this approach is not yet sufficient. There can be various reasons 
why a failure in DEPLOY happens, failed checkpoint restore is only one of the 
reasons.
    
    This also adds some coupling of execution graph state and checkpoint 
coordinator (last restored checkpoint ID) which breaks design and 
responsibilities.
    
    A proper solution here is probably a bit more comprehensive - and need a 
bit more thinking, probably a bigger design document. my first though would be 
to report a proper RestoreException from the TaskManager, keeping a history of 
exceptions that triggered recovery, using that to evaluate fallback, etc.


> Executions failed from "DEPLOYING" should retain restored checkpoint 
> information
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-4816
>                 URL: https://issues.apache.org/jira/browse/FLINK-4816
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>            Reporter: Stephan Ewen
>            Assignee: Wei-Che Wei
>
> When an execution fails from state {{DEPLOYING}}, it should wrap the failure 
> to better report the failure cause:
>   - If no checkpoint was restored, it should wrap the exception in a 
> {{DeployTaskException}}
>   - If a checkpoint was restored, it should wrap the exception in a 
> {{RestoreTaskException}} and record the id of the checkpoint that was 
> attempted to be restored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to