Github user uce commented on the issue:

    https://github.com/apache/flink/pull/2096
  
    Really good changes and great that you added the `JobStatus` graph 
transitions figure to the docs.
    
    I looked at all changes and tests. Furthermore, I've tested this with lost 
ZooKeeper connections and it works as expected. The job is suspended and 
recovered smoothly. 👍 
    
    Some general notes:
    - The only thing that might be slightly confusing for users is that until 
the web frontend is notified about the leader changes, it shows the job status 
as FAILED until it disappears. This is probably some web interface quirk, 
because it does not recognize the new job status. I understand that you remove 
suspended job graphs and don't archive them (which makes sense as they are 
recovered later), but giving that it takes some time until all information 
propagates to the web frontend, we might want to temporarily show the state as 
suspended. We can do this as a follow up.
    
    - Regarding the removal of restart strategy disabling, I thought about 
whether we should leave it in and get rid of the `SuppressRestartsException` 
instead. That seems to be cleaner than the custom exception, which is checked 
in the `ExecutionGraph`. If we want to get rid of it (as a follow up), we 
should leave the disable call in.
    
    - If you think that 
https://github.com/apache/flink/pull/2095#issuecomment-227731879 is a valid 
point, then we should apply it for the suspended state as well.
    
    These comments are not blockers and I think this is good to merge. Great 
job!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to