zhuzhurk commented on a change in pull request #9113: [FLINK-13222] [runtime]
Add documentation for failover strategy option
URL: https://github.com/apache/flink/pull/9113#discussion_r307184994
##########
File path: docs/dev/task_failure_recovery.md
##########
@@ -264,4 +267,54 @@ The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+## Failover Strategies
+
+Flink supports different failover strategies which can be configured via the
configuration parameter
+*jobmanager.execution.failover-strategy* in Flink's configuration file
`flink-conf.yaml`.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 50%">Failover Strategy</th>
+ <th class="text-left">Value for
jobmanager.execution.failover-strategy</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>Restart all</td>
+ <td>full</td>
+ </tr>
+ <tr>
+ <td>Restart pipelined region</td>
+ <td>region</td>
+ </tr>
+ </tbody>
+</table>
+
+### Restart All Failover Strategy
+
+This strategy restarts all tasks in the job to recover from a task failure.
+
+### Restart Pipelined Region Failover Strategy
+
+This strategy groups tasks into disjoint regions. When a task failure is
detected,
+this strategy computes the smallest set of regions that must be restarted to
recover from the failure.
+For some jobs this can result in fewer tasks that will be restarted compared
to the Restart All Failover Strategy.
+
+A region is a set of tasks that communicate via pipelined data exchanges.
+That is, batch data exchanges denote the boundaries of a region.
+- All data exchanges in a DataStream job or Streaming Table/SQL job are
pipelined.
+- All data exchanges in a Batch Table/SQL job are batched by default.
+- The data exchange types in a DataSet job are determined by the
+ [ExecutionMode]({{ site.javadocs_baseurl
}}/api/java/org/apache/flink/api/common/ExecutionMode.html)
+ which can be set through [ExecutionConfig]({{ site.baseurl
}}/dev/execution_configuration.html).
+
+The regions to restart are decided as below:
+1. The region containing the failed task will be restarted.
+2. If a result partition is not available while it is required by a region
that will be restarted,
+ the region producing the result partition will be restarted as well.
+3. If a region is to be restarted, all of its consumer regions will also be
restarted. This is to guarantee
+ data consistency because nondeterministic processing or partitioning can
result in that a partition
Review comment:
Sorry, I thought I have published the previous reply but it is not.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services