Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    I think this approach is not yet sufficient. There can be various reasons 
why a failure in DEPLOY happens, failed checkpoint restore is only one of the 
reasons.
    
    This also adds some coupling of execution graph state and checkpoint 
coordinator (last restored checkpoint ID) which breaks design and 
responsibilities.
    
    A proper solution here is probably a bit more comprehensive - and need a 
bit more thinking, probably a bigger design document. my first though would be 
to report a proper RestoreException from the TaskManager, keeping a history of 
exceptions that triggered recovery, using that to evaluate fallback, etc.


---

Reply via email to