Re: [PR] [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy [flink]

via GitHub Mon, 05 Feb 2024 05:48:27 -0800


zhuzhurk commented on code in PR #24263:
URL: https://github.com/apache/flink/pull/24263#discussion_r1478238605



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.

Review Comment:
   then retried -> then retry



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:

Review Comment:
   "and set it as the default restart strategy " looks a bit redundant to me



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second 
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time, 
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth 
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time, 
the job will be delayed for 10 second and then retried 
+  (it will exceed the upper limit after doubling, so the upper limit of 10 
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit 
(max-backoff), so after the fifth retry, the delay time will be always 10 
seconds. 
+  After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted 
by a random value, and the ration range of the random value is within 0.1. For 
example:
+  - In the third retry, the job delay time is between 3.6 seconds and 4.4 
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+  - In the fourth retry, the job delay time is between 7.2 seconds and 8.8 
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+  - Random values can prevent multiple jobs restart at the same time, so it is 
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has 
exception after 8 consecutive retries, it will fail (no more retries).
+- `reset-backoff-threshold = 6 min` means that when the job has lasted for 6 
minutes without an exception, the delay time and retry count will be reset. 

Review Comment:
   lasted -> run
   retry count -> retry counter



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second 
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time, 
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth 
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time, 
the job will be delayed for 10 second and then retried 
+  (it will exceed the upper limit after doubling, so the upper limit of 10 
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit 
(max-backoff), so after the fifth retry, the delay time will be always 10 
seconds. 
+  After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted 
by a random value, and the ration range of the random value is within 0.1. For 
example:
+  - In the third retry, the job delay time is between 3.6 seconds and 4.4 
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+  - In the fourth retry, the job delay time is between 7.2 seconds and 8.8 
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+  - Random values can prevent multiple jobs restart at the same time, so it is 
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has 
exception after 8 consecutive retries, it will fail (no more retries).

Review Comment:
   has exception -> encounter exceptions



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:

Review Comment:
   > because compared to other restart strategies, it can: Jobs can be retried 
quickly ...
   
   Maybe "because by using this strategy, jobs can be retried quickly ..."



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid 
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay 
restart strategy is fixed.
+    - If the delay time is set too short, when exceptions occur frequently in 
a short period of time, the master node of external service will be accessed 
frequently, which may cause an avalanche of the external service.

Review Comment:
   service -> services



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid 
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay 
restart strategy is fixed.
+    - If the delay time is set too short, when exceptions occur frequently in 
a short period of time, the master node of external service will be accessed 
frequently, which may cause an avalanche of the external service.
+      For example: a large number of Flink jobs are consuming Kafka. When the 
Kafka cluster crashes, a large number of Flink jobs are frequently retried at 
the same time, which is likely to cause an avalanche.
+    - If the delay time is set too long, when the exception occurs 
occasionally, it will have to wait a long time before retrying, resulting in 
reduced job availability.

Review Comment:
   the exception -> exceptions
   
   it -> jobs



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:
+
+- All restart strategies will delay some time when restarting the job to avoid 
frequent retries that put greater pressure on external components.
+- The delay time for all restart strategies except the exponential delay 
restart strategy is fixed.
+    - If the delay time is set too short, when exceptions occur frequently in 
a short period of time, the master node of external service will be accessed 
frequently, which may cause an avalanche of the external service.
+      For example: a large number of Flink jobs are consuming Kafka. When the 
Kafka cluster crashes, a large number of Flink jobs are frequently retried at 
the same time, which is likely to cause an avalanche.
+    - If the delay time is set too long, when the exception occurs 
occasionally, it will have to wait a long time before retrying, resulting in 
reduced job availability.
+- The delay time of each retry of the exponential delay restart strategy will 
increase exponentially until the maximum delay time is reached.
+    - The initial value of the delay time is shorter, so when the exception 
occurs occasionally, it can be retried quickly to improve job availability.
+    - When exceptions occur frequently in a short period of time, the 
exponential delay restart strategy will reduce the frequency of retries to 
avoid an avalanche of external service.
+- In addition, the delay time of the exponential delay restart strategy 
supports the jitter-factor configuration option.
+    - The jitter factor adds or subtracts a random value to each delay time.
+    - Even if multiple jobs use an exponential delay restart strategy and the 
value of all configuration options are exactly the same, the jitter factor will 
let these jobs to restart at different times.

Review Comment:
   to restart -> restart



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second 
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time, 
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth 
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time, 
the job will be delayed for 10 second and then retried 

Review Comment:
   Maybe using 1st, 2nd, 3rd, 4th and 5th would make it easier for reading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy [flink]

Reply via email to