[ 
https://issues.apache.org/jira/browse/FLINK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319632#comment-16319632
 ] 

ASF GitHub Bot commented on FLINK-4816:
---------------------------------------

Github user tony810430 commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    Hi @StephanEwen 
    
    Let me conclude your comment and clarify some questions in my mind.
    1. The original design treated all failures in DEPLOY as restore failure. 
That is not fair because it is just one of the reasons.
    2. Using `last restored checkpoint ID` to record latest id is not a proper 
way. Maybe I need to put it in state object. Am I right?
    3. A better solution might be tracking all failures in TaskManager, and 
only report those failure related to restore as restore failure. Then wrapping 
it with the current checkpoint id and send it back to JobManager.
    
    Do I misunderstand something? Or is there anything else that I didn't 
mentioned?


> Executions failed from "DEPLOYING" should retain restored checkpoint 
> information
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-4816
>                 URL: https://issues.apache.org/jira/browse/FLINK-4816
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>            Reporter: Stephan Ewen
>            Assignee: Wei-Che Wei
>
> When an execution fails from state {{DEPLOYING}}, it should wrap the failure 
> to better report the failure cause:
>   - If no checkpoint was restored, it should wrap the exception in a 
> {{DeployTaskException}}
>   - If a checkpoint was restored, it should wrap the exception in a 
> {{RestoreTaskException}} and record the id of the checkpoint that was 
> attempted to be restored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to