(flink) branch master updated: [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy

fanrui Tue, 06 Feb 2024 03:35:15 -0800

This is an automated email from the ASF dual-hosted git repository.

fanrui pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git



The following commit(s) were added to refs/heads/master by this push:
     new 2f040e38eab [FLINK-33739][doc] Document FLIP-364: Improve the 
exponential-delay restart-strategy
2f040e38eab is described below

commit 2f040e38eabacb24bba0907590ea4a4c3f91b9ec
Author: Rui Fan <[email protected]>
AuthorDate: Sun Feb 4 17:48:38 2024 +0800

    [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay 
restart-strategy
---
 .../docs/ops/state/task_failure_recovery.md        | 68 ++++++++++++++++++--
 .../docs/ops/state/task_failure_recovery.md        | 72 +++++++++++++++++++---
 2 files changed, 128 insertions(+), 12 deletions(-)

diff --git a/docs/content.zh/docs/ops/state/task_failure_recovery.md 
b/docs/content.zh/docs/ops/state/task_failure_recovery.md
index b7e9dba546c..a2ed511ce97 100644
--- a/docs/content.zh/docs/ops/state/task_failure_recovery.md
+++ b/docs/content.zh/docs/ops/state/task_failure_recovery.md
@@ -39,8 +39,9 @@ Flink 作业如果没有定义重启策略，则会遵循集群启动时加载
 如果提交作业时设置了重启策略，该策略将覆盖掉集群的默认策略。
 
 通过 [Flink 配置文件]({{< ref "docs/deployment/config#flink-配置文件" >}}) 
来设置默认的重启策略。配置参数 *restart-strategy.type* 定义了采取何种策略。
-如果没有启用 checkpoint，就采用“不重启”策略。如果启用了 checkpoint 且没有配置重启策略，那么就采用固定延时重启策略，
-此时最大尝试重启次数由 `Integer.MAX_VALUE` 参数设置。下表列出了可用的重启策略和与其对应的配置值。
+如果没有启用 checkpoint，就采用 `不重启` 策略。如果启用了 checkpoint 且没有配置重启策略，默认采用
+`exponential-delay` (指数延迟) 重启策略，且会使用 `exponential-delay` 相关配置项的默认值。
+下表列出了可用的重启策略和与其对应的配置值。
 
 每个重启策略都有自己的一组配置参数来控制其行为。
 这些参数也在配置文件中设置。
@@ -142,8 +143,7 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 ### Exponential Delay Restart Strategy
 
-指数延迟重启策略无限地重启作业，作业永远不失败。
-在两次连续的重新启动尝试之间，重新启动的延迟时间不断呈指数增长，直到达到最大延迟时间。
+指数延迟重启策略在两次连续的重新启动尝试之间，重新启动的延迟时间不断呈指数增长，直到达到最大延迟时间。
 然后，延迟时间将保持在最大延迟时间。
 
 当作业正确地执行后，指数延迟时间会在一些时间后被重置为初始值，这些阈值可以被配置。
@@ -199,6 +199,46 @@ Python API 不支持。
 {{< /tab >}}
 {{< /tabs >}}
 
+#### 示例
+
+以下是一个示例，用于解释指数延迟重启策略的工作原理。
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# 为了方便描述，这里关闭了 jitter
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` 表示当作业第一次发生异常时会延迟 1 秒后进行重试。
+- `backoff-multiplier = 2` 表示当作业连续异常时，每次的延迟时间翻倍。
+- `max-backoff = 10 s` 表示重试的延迟时间最多为 10 秒。
+
+基于这些参数：
+
+- 当作业发生异常需要进行第 1 次重试时，作业会延迟 1 秒后重试。
+- 当作业发生异常需要进行第 2 次重试时，作业会延迟 2 秒后重试(翻倍)。
+- 当作业发生异常需要进行第 3 次重试时，作业会延迟 4 秒后重试(翻倍)。
+- 当作业发生异常需要进行第 4 次重试时，作业会延迟 8 秒后重试(翻倍)。
+- 当作业发生异常需要进行第 5 次重试时，作业会延迟 10 秒后重试(翻倍后超过上限，所以使用上限 10 秒做为延迟时间)。
+- 在第 5 次重试时，延迟时间已经达到了 max-backoff(上限)，所以第 5 次重试以后，作业延迟时间会保持在 10 秒不变，每次失败后都会延迟 
10 秒后重试。
+
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` 表示每次的延迟时间会加减一个随机值，随机值的范围在 0.1 的比例内。
+  - 例如第 3 次重试时，作业延迟时间在 3.6 秒到 4.4 秒之间( 3.6 = 4 * 0.9, 4.4 = 4 * 1.1)。
+  - 例如第 4 次重试时，作业延迟时间在 7.2 秒到 8.8 秒之间 (7.2 = 8 * 0.9, 8.8 = 8 * 1.1)。
+  - 随机值可以避免多个作业在同一时间重启，所以在生产环境不建议将 jitter-factor 设置为 0。
+- `attempts-before-reset-backoff = 8` 表示如果作业连续重试了 8 次后仍然有异常，则会失败（不再重试）。
+- `reset-backoff-threshold = 6 min` 表示当作业已经持续 6 分钟没发生异常时，则会重置延迟时间和重试计数。
+  也就是当作业发生异常时，如果上一次异常发生在 6 分钟之前，则重试的延迟时间重置为 1 秒，当前的重试计数重置为 1。
+
 
 ### Failure Rate Restart Strategy
 
@@ -294,7 +334,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 使用群集定义的重启策略。
 这对于启用了 checkpoint 的流处理程序很有帮助。
-如果没有定义其他重启策略，默认选择固定延时重启策略。
+如果没有定义其他重启策略，默认选择指数延迟重启策略。
+
+### 默认重启策略
+
+当 Checkpoint 开启且用户没有指定重启策略时，[`指数延迟重启策略`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy) 
+是当前默认的重启策略。我们强烈推荐 Flink 用户使用指数延迟重启策略，因为使用这个策略时，
+作业偶尔异常可以快速重试，作业频繁异常可以避免外部组件发生雪崩。原因如下所示：
+
+- 所有的重启策略在重启作业时都会延迟一定的时间来避免频繁重试对外部组件的产生较大压力。
+- 除了指数延迟重启策略以外的所有重启策略延迟时间都是固定的。
+  - 如果延迟时间设置的过短，当作业短时间内频繁异常时，会频繁重启访问外部组件的主节点，可能导致外部组件发生雪崩。
+    例如：大量的 Flink 作业都在消费 Kafka，当 Kafka 集群出现故障时大量的 Flink 作业都在同一时间频繁重试，很可能导致雪崩。
+  - 如果延迟时间设置的过长，当作业偶尔失败时需要等待很久才会重试，从而导致作业可用率降低。
+- 指数延迟重启策略每次重试的延迟时间会指数递增，直到达到最大延迟时间。
+  - 延迟时间的初始值较短，所以当作业偶尔失败时，可以快速重试，提升作业可用率。
+  - 当作业短时间内频繁失败时，指数延迟重启策略会降低重试的频率，从而避免外部组件雪崩。
+- 除此以外，指数延迟重启策略的延迟时间支持抖动因子 (jitter-factor) 的配置项。
+  - 抖动因子会为每次的延迟时间加减一个随机值。
+  - 即使多个作业使用指数延迟重启策略且所有的配置参数完全相同，抖动因子也会让这些作业分散在不同的时间重启。
 
 ## Failover Strategies
 
diff --git a/docs/content/docs/ops/state/task_failure_recovery.md 
b/docs/content/docs/ops/state/task_failure_recovery.md
index 094667fea29..a091d79e35b 100644
--- a/docs/content/docs/ops/state/task_failure_recovery.md
+++ b/docs/content/docs/ops/state/task_failure_recovery.md
@@ -41,9 +41,10 @@ In case that the job is submitted with a restart strategy, 
this strategy overrid
 
 The default restart strategy is set via [Flink configuration file]({{< ref 
"docs/deployment/config#flink-configuration-file" >}}).
 The configuration parameter *restart-strategy.type* defines which strategy is 
taken.
-If checkpointing is not enabled, the "no restart" strategy is used.
-If checkpointing is activated and the restart strategy has not been 
configured, the fixed-delay strategy is used with 
-`Integer.MAX_VALUE` restart attempts.
+If checkpointing is not enabled, the `no restart` strategy is used.
+If checkpointing is activated and the restart strategy has not been 
configured, 
+the `exponential-delay` restart strategy and the default values of 
`exponential-delay` 
+related config options will be used.
 See the following list of available restart strategies to learn what values 
are supported.
 
 Each restart strategy comes with its own set of parameters which control its 
behaviour.
@@ -146,9 +147,7 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 ### Exponential Delay Restart Strategy
 
-The exponential delay restart strategy attempts to restart the job infinitely, 
with increasing delay up to the maximum delay.
-The job never fails.
-In-between two consecutive restart attempts, the restart strategy keeps 
exponentially increasing until the maximum number is reached.
+In-between two consecutive restart attempts, the exponential delay restart 
strategy keeps exponentially increasing until the maximum number is reached.
 Then, it keeps the delay at the maximum number.
 
 When the job executes correctly, the exponential delay value resets after some 
time; this threshold is configurable.
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the 1st time, 
the job will be delayed for 1 second and then retry.
+- When an exception occurs and the job needs to be retried for the 2nd time, 
the job will be delayed for 2 second and then retry.
+- When an exception occurs and the job needs to be retried for the 3rd time, 
the job will be delayed for 4 second and then retry.
+- When an exception occurs and the job needs to be retried for the 4th time, 
the job will be delayed for 8 second and then retry.
+- When an exception occurs and the job needs to be retried for the 5th time, 
the job will be delayed for 10 second and then retry 
+  (it will exceed the upper limit after doubling, so the upper limit of 10 
seconds is used as the delay time)..
+- On the 5th retry, the delay time has reached the upper limit (max-backoff), 
so after the 5th retry, the delay time will be always 10 seconds. 
+  After each failure, it will be delayed for 10 seconds and then retry.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted 
by a random value, and the ration range of the random value is within 0.1. For 
example:
+  - In the 3rd retry, the job delay time is between 3.6 seconds and 4.4 
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+  - In the 4th retry, the job delay time is between 7.2 seconds and 8.8 
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+  - Random values can prevent multiple jobs restart at the same time, so it is 
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still encounters 
exceptions after 8 consecutive retries, it will fail (no more retries).
+- `reset-backoff-threshold = 6 min` means that when the job runs for 6 minutes 
without an exception, the delay time and retry counter will be reset. 
+  That is, when an exception occurs in a job, if the last exception occurred 6 
minutes ago, the retry delay time is reset to 1 second and the current retry 
counter is reset to 1.
+
 ### Failure Rate Restart Strategy
 
 The failure rate restart strategy restarts job after failure, but when 
`failure rate` (failures per time interval) is exceeded, the job eventually 
fails.
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy because by using this strategy, 
+jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid 
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay 
restart strategy is fixed.
+    - If the delay time is set too short, when exceptions occur frequently in 
a short period of time, the master node of external service will be accessed 
frequently, which may cause an avalanche of the external services.
+      For example: a large number of Flink jobs are consuming Kafka. When the 
Kafka cluster crashes, a large number of Flink jobs are frequently retried at 
the same time, which is likely to cause an avalanche.
+    - If the delay time is set too long, when exceptions occur occasionally, 
jobs will have to wait a long time before retrying, resulting in reduced job 
availability.
+- The delay time of each retry of the exponential delay restart strategy will 
increase exponentially until the maximum delay time is reached.
+    - The initial value of the delay time is shorter, so when exceptions occur 
occasionally, jobs can be retried quickly to improve job availability.
+    - When exceptions occur frequently in a short period of time, the 
exponential delay restart strategy will reduce the frequency of retries to 
avoid an avalanche of external services.
+- In addition, the delay time of the exponential delay restart strategy 
supports the jitter-factor configuration option.
+    - The jitter factor adds or subtracts a random value to each delay time.
+    - Even if multiple jobs use an exponential delay restart strategy and the 
value of all configuration options are exactly the same, the jitter factor will 
let these jobs restart at different times.
 
 ## Failover Strategies

(flink) branch master updated: [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy

Reply via email to