zhuzhurk commented on code in PR #24263:
URL: https://github.com/apache/flink/pull/24263#discussion_r1478238605
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
Review Comment:
then retried -> then retry
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
Review Comment:
"and set it as the default restart strategy " looks a bit redundant to me
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time,
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time,
the job will be delayed for 10 second and then retried
+ (it will exceed the upper limit after doubling, so the upper limit of 10
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit
(max-backoff), so after the fifth retry, the delay time will be always 10
seconds.
+ After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted
by a random value, and the ration range of the random value is within 0.1. For
example:
+ - In the third retry, the job delay time is between 3.6 seconds and 4.4
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+ - In the fourth retry, the job delay time is between 7.2 seconds and 8.8
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+ - Random values can prevent multiple jobs restart at the same time, so it is
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has
exception after 8 consecutive retries, it will fail (no more retries).
+- `reset-backoff-threshold = 6 min` means that when the job has lasted for 6
minutes without an exception, the delay time and retry count will be reset.
Review Comment:
lasted -> run
retry count -> retry counter
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time,
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time,
the job will be delayed for 10 second and then retried
+ (it will exceed the upper limit after doubling, so the upper limit of 10
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit
(max-backoff), so after the fifth retry, the delay time will be always 10
seconds.
+ After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted
by a random value, and the ration range of the random value is within 0.1. For
example:
+ - In the third retry, the job delay time is between 3.6 seconds and 4.4
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+ - In the fourth retry, the job delay time is between 7.2 seconds and 8.8
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+ - Random values can prevent multiple jobs restart at the same time, so it is
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has
exception after 8 consecutive retries, it will fail (no more retries).
Review Comment:
has exception -> encounter exceptions
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches
of external components can be avoided when exceptions occur frequently. The
reasons are as follows:
Review Comment:
> because compared to other restart strategies, it can: Jobs can be retried
quickly ...
Maybe "because by using this strategy, jobs can be retried quickly ..."
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches
of external components can be avoided when exceptions occur frequently. The
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay
restart strategy is fixed.
+ - If the delay time is set too short, when exceptions occur frequently in
a short period of time, the master node of external service will be accessed
frequently, which may cause an avalanche of the external service.
Review Comment:
service -> services
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches
of external components can be avoided when exceptions occur frequently. The
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay
restart strategy is fixed.
+ - If the delay time is set too short, when exceptions occur frequently in
a short period of time, the master node of external service will be accessed
frequently, which may cause an avalanche of the external service.
+ For example: a large number of Flink jobs are consuming Kafka. When the
Kafka cluster crashes, a large number of Flink jobs are frequently retried at
the same time, which is likely to cause an avalanche.
+ - If the delay time is set too long, when the exception occurs
occasionally, it will have to wait a long time before retrying, resulting in
reduced job availability.
Review Comment:
the exception -> exceptions
it -> jobs
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env =
StreamExecutionEnvironment.get_execution_environment(config)
The cluster defined restart strategy is used.
This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy,
[`Exponential delay restart strategy`]({{< ref
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to
use the exponential delay restart strategy and set it as the default restart
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches
of external components can be avoided when exceptions occur frequently. The
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay
restart strategy is fixed.
+ - If the delay time is set too short, when exceptions occur frequently in
a short period of time, the master node of external service will be accessed
frequently, which may cause an avalanche of the external service.
+ For example: a large number of Flink jobs are consuming Kafka. When the
Kafka cluster crashes, a large number of Flink jobs are frequently retried at
the same time, which is likely to cause an avalanche.
+ - If the delay time is set too long, when the exception occurs
occasionally, it will have to wait a long time before retrying, resulting in
reduced job availability.
+- The delay time of each retry of the exponential delay restart strategy will
increase exponentially until the maximum delay time is reached.
+ - The initial value of the delay time is shorter, so when the exception
occurs occasionally, it can be retried quickly to improve job availability.
+ - When exceptions occur frequently in a short period of time, the
exponential delay restart strategy will reduce the frequency of retries to
avoid an avalanche of external service.
+- In addition, the delay time of the exponential delay restart strategy
supports the jitter-factor configuration option.
+ - The jitter factor adds or subtracts a random value to each delay time.
+ - Even if multiple jobs use an exponential delay restart strategy and the
value of all configuration options are exactly the same, the jitter factor will
let these jobs to restart at different times.
Review Comment:
to restart -> restart
##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
{{< /tab >}}
{{< /tabs >}}
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions,
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time,
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time,
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time,
the job will be delayed for 10 second and then retried
Review Comment:
Maybe using 1st, 2nd, 3rd, 4th and 5th would make it easier for reading.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]