Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/4828 Hi @StephanEwen Let me conclude your comment and clarify some questions in my mind. 1. The original design treated all failures in DEPLOY as restore failure. That is not fair because it is just one of the reasons. 2. Using `last restored checkpoint ID` to record latest id is not a proper way. Maybe I need to put it in state object. Am I right? 3. A better solution might be tracking all failures in TaskManager, and only report those failure related to restore as restore failure. Then wrapping it with the current checkpoint id and send it back to JobManager. Do I misunderstand something? Or is there anything else that I didn't mentioned?
---