Github user StephanEwen commented on a diff in the pull request:
https://github.com/apache/flink/pull/1223#discussion_r41239329
--- Diff: docs/apis/programming_guide.md ---
@@ -1992,6 +1992,8 @@ With the closure cleaner disabled, it might happen
that an anonymous user functi
- `getNumberOfExecutionRetries()` / `setNumberOfExecutionRetries(int
numberOfExecutionRetries)` Sets the number of times that failed tasks are
re-executed. A value of zero effectively disables fault tolerance. A value of
`-1` indicates that the system default value (as defined in the configuration)
should be used.
+- `getExecutionRetryDelay()` / `setExecutionRetryDelay(long
executionRetryDelay)` Sets the delay that failed tasks are re-executed. A value
of `-1` indicates that the default value should be used.
--- End diff --
I think this is a critical parameter, so I would like to extend the
description a bit. How about this:
```
Sets the delay that the system waits after a job has failed, before
re-executing it. The delay starts after all tasks have been successfully been
stopped on the TaskManagers, and once the delay is past, the tasks are
re-started. This parameter is useful to delay re-execution in order to let
certain time-out related failures surface fully (like broken connections that
have not fully timed out), before attempting a re-execution and immediately
failing again due to the same problem.
This parameter only has an effect if the number of execution re-tries is
one or more.
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---