Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171928797
  
    We could set the default execution retry delay to 0 assuming that any 
longer timeout in a streaming use case would render the job wrong anyway. 
However, if a longer timeout is acceptable, then we would lose the ability to 
recover from a lost task manager which usually takes some time to reconnect to 
the cluster (given that we use all instances).
    
    I would be more in favour of having an exponential back off strategy as the 
default. This would give us a quick recovery in case that we have enough 
resources available but also the possibility to wait for a TM reconnection. We 
could implement such a restart strategy once the PR #1470 is merged.
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to