1996fanrui commented on code in PR #24263:
URL: https://github.com/apache/flink/pull/24263#discussion_r1479152005
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches
of external components can be avoided when exceptions occur frequently. The
reasons are as follows:
Review Comment:
I updated zh doc as well.
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
Review Comment:
Sounds make sense, I have removed it in the Chinese doc as well.
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time,
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time,
the job will be delayed for 10 second and then retried
+ (it will exceed the upper limit after doubling, so the upper limit of 10
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit
(max-backoff), so after the fifth retry, the delay time will be always 10
seconds.
+ After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted
by a random value, and the ration range of the random value is within 0.1. For
example:
+ - In the third retry, the job delay time is between 3.6 seconds and 4.4
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+ - In the fourth retry, the job delay time is between 7.2 seconds and 8.8
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+ - Random values can prevent multiple jobs restart at the same time, so it is
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has
exception after 8 consecutive retries, it will fail (no more retries).
Review Comment:
Change it to `encounters exceptions`
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time,
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time,
the job will be delayed for 10 second and then retried
+ (it will exceed the upper limit after doubling, so the upper limit of 10
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit
(max-backoff), so after the fifth retry, the delay time will be always 10
seconds.
+ After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted
by a random value, and the ration range of the random value is within 0.1. For
example:
+ - In the third retry, the job delay time is between 3.6 seconds and 4.4
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+ - In the fourth retry, the job delay time is between 7.2 seconds and 8.8
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+ - Random values can prevent multiple jobs restart at the same time, so it is
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has
exception after 8 consecutive retries, it will fail (no more retries).
+- `reset-backoff-threshold = 6 min` means that when the job has lasted for 6
minutes without an exception, the delay time and retry count will be reset.
Review Comment:
`has lasted` -> `runs`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]