Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/1954#discussion_r67505611
--- Diff: docs/apis/streaming/fault_tolerance.md ---
@@ -338,6 +342,77 @@ The default value is the value of *akka.ask.timeout*.
{% top %}
+### Failure Rate Restart Strategy
+
+The failure rate restart strategy restarts job after failure, but when
`failure rate` (failures per time unit) is exceeded, the job eventually fails.
+In-between two consecutive restart attempts, the restart strategy waits a
fixed amount of time.
+
+This strategy is enabled as default by setting the following configuration
parameter in `flink-conf.yaml`.
+
+~~~
+restart-strategy: failure-rate
+~~~
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 40%">Configuration Parameter</th>
+ <th class="text-left" style="width: 40%">Description</th>
+ <th class="text-left">Default Value</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+
<td><it>restart-strategy.failure-rate.max-failures-per-unit</it></td>
+ <td>Maximum number of restarts in given time unit before failing a
job</td>
+ <td>1</td>
+ </tr>
+ <tr>
+ <td><it>restart-strategy.failure-rate.failure-rate-unit</it></td>
+ <td>Time unit for measuring failure rate. One of
java.util.concurrent.TimeUnit values</td>
+ <td>MINUTES</td>
+ </tr>
+ <tr>
+ <td><it>restart-strategy.failure-rate.delay</it></td>
+ <td>Delay between two consecutive restart attempts</td>
+ <td><it>akka.ask.timeout</it></td>
+ </tr>
+ </tbody>
+</table>
+
+~~~
+restart-strategy.failure-rate.max-failures-per-unit: 3
+restart-strategy.failure-rate.failure-rate-unit: MINUTES
+restart-strategy.failure-rate.delay: 10 s
+~~~
+
+The failure rate restart strategy can also be set programmatically:
+
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
+env.setRestartStrategy(RestartStrategies.failureRateRestart(
+ 3, // max failures per unit
+ java.util.concurrent.TimeUnit.MINUTES, //time unit for measuring failure
rate
+ 10000 // delay in milliseconds
--- End diff --
Maybe we should support a more flexible delay specification. Something like
`"10 seconds"` or `TimeUnit.seconds(10)`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---