[ https://issues.apache.org/jira/browse/FLINK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319632#comment-16319632 ]
ASF GitHub Bot commented on FLINK-4816: --------------------------------------- Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/4828 Hi @StephanEwen Let me conclude your comment and clarify some questions in my mind. 1. The original design treated all failures in DEPLOY as restore failure. That is not fair because it is just one of the reasons. 2. Using `last restored checkpoint ID` to record latest id is not a proper way. Maybe I need to put it in state object. Am I right? 3. A better solution might be tracking all failures in TaskManager, and only report those failure related to restore as restore failure. Then wrapping it with the current checkpoint id and send it back to JobManager. Do I misunderstand something? Or is there anything else that I didn't mentioned? > Executions failed from "DEPLOYING" should retain restored checkpoint > information > -------------------------------------------------------------------------------- > > Key: FLINK-4816 > URL: https://issues.apache.org/jira/browse/FLINK-4816 > Project: Flink > Issue Type: Sub-task > Components: Distributed Coordination > Reporter: Stephan Ewen > Assignee: Wei-Che Wei > > When an execution fails from state {{DEPLOYING}}, it should wrap the failure > to better report the failure cause: > - If no checkpoint was restored, it should wrap the exception in a > {{DeployTaskException}} > - If a checkpoint was restored, it should wrap the exception in a > {{RestoreTaskException}} and record the id of the checkpoint that was > attempted to be restored. -- This message was sent by Atlassian JIRA (v6.4.14#64029)