Github user tony810430 commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    Hi @StephanEwen 
    
    Let me conclude your comment and clarify some questions in my mind.
    1. The original design treated all failures in DEPLOY as restore failure. 
That is not fair because it is just one of the reasons.
    2. Using `last restored checkpoint ID` to record latest id is not a proper 
way. Maybe I need to put it in state object. Am I right?
    3. A better solution might be tracking all failures in TaskManager, and 
only report those failure related to restore as restore failure. Then wrapping 
it with the current checkpoint id and send it back to JobManager.
    
    Do I misunderstand something? Or is there anything else that I didn't 
mentioned?


---

Reply via email to