Re: [PR] [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy [flink]

via GitHub Mon, 05 Feb 2024 18:34:37 -0800


1996fanrui commented on code in PR #24263:
URL: https://github.com/apache/flink/pull/24263#discussion_r1479152005



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:
+Jobs can be retried quickly when exceptions occur occasionally, and avalanches 
of external components can be avoided when exceptions occur frequently. The 
reasons are as follows:

Review Comment:
   I updated zh doc as well.



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -296,7 +336,25 @@ env = 
StreamExecutionEnvironment.get_execution_environment(config)
 
 The cluster defined restart strategy is used. 
 This is helpful for streaming programs which enable checkpointing.
-By default, a fixed delay restart strategy is chosen if there is no other 
restart strategy defined.
+By default, the exponential delay restart strategy is chosen if there is no 
other restart strategy defined.
+
+### Default restart strategy
+
+When Checkpoint is enabled and the user does not specify a restart strategy, 
[`Exponential delay restart strategy`]({{< ref 
"docs/ops/state/task_failure_recovery" >}}#exponential-delay-restart-strategy)
+is the current default restart strategy. We strongly recommend Flink users to 
use the exponential delay restart strategy and set it as the default restart 
strategy because compared to other restart strategies, it can:

Review Comment:
   Sounds make sense, I have removed it in the Chinese doc as well.



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second 
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time, 
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth 
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time, 
the job will be delayed for 10 second and then retried 
+  (it will exceed the upper limit after doubling, so the upper limit of 10 
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit 
(max-backoff), so after the fifth retry, the delay time will be always 10 
seconds. 
+  After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted 
by a random value, and the ration range of the random value is within 0.1. For 
example:
+  - In the third retry, the job delay time is between 3.6 seconds and 4.4 
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+  - In the fourth retry, the job delay time is between 7.2 seconds and 8.8 
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+  - Random values can prevent multiple jobs restart at the same time, so it is 
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has 
exception after 8 consecutive retries, it will fail (no more retries).

Review Comment:
   Change it to `encounters exceptions`



##########
docs/content/docs/ops/state/task_failure_recovery.md:
##########
@@ -204,6 +203,47 @@ Still not supported in Python API.
 {{< /tab >}}
 {{< /tabs >}}
 
+#### Example
+
+Here is an example to explain how the exponential delay restart strategy works.
+
+```yaml
+restart-strategy.exponential-delay.initial-backoff: 1 s
+restart-strategy.exponential-delay.backoff-multiplier: 2
+restart-strategy.exponential-delay.max-backoff: 10 s
+# For convenience of description, jitter is turned off here
+restart-strategy.exponential-delay.jitter-factor: 0
+```
+
+- `initial-backoff = 1s` means that when an exception occurs for the first 
time, the job will be delayed for 1 second before retrying.
+- `backoff-multiplier = 2` means that when the job has continuous exceptions, 
the delay time is doubled each time.
+- `max-backoff = 10 s` means the retry delay is at most 10 seconds.
+
+Based on these parameters:
+
+- When an exception occurs and the job needs to be retried for the first time, 
the job will be delayed for 1 second and then retried.
+- When an exception occurs and the job needs to be retried for the second 
time, the job will be delayed for 2 second and then retried.
+- When an exception occurs and the job needs to be retried for the third time, 
the job will be delayed for 4 second and then retried.
+- When an exception occurs and the job needs to be retried for the fourth 
time, the job will be delayed for 8 second and then retried.
+- When an exception occurs and the job needs to be retried for the fifth time, 
the job will be delayed for 10 second and then retried 
+  (it will exceed the upper limit after doubling, so the upper limit of 10 
seconds is used as the delay time)..
+- On the fifth retry, the delay time has reached the upper limit 
(max-backoff), so after the fifth retry, the delay time will be always 10 
seconds. 
+  After each failure, it will be delayed for 10 seconds and then retried.
+
+```yaml
+restart-strategy.exponential-delay.jitter-factor: 0.1
+restart-strategy.exponential-delay.attempts-before-reset-backoff: 8
+restart-strategy.exponential-delay.reset-backoff-threshold: 6 min
+```
+
+- `jitter-factor = 0.1` means that each delay time will be added or subtracted 
by a random value, and the ration range of the random value is within 0.1. For 
example:
+  - In the third retry, the job delay time is between 3.6 seconds and 4.4 
seconds (3.6 = 4 * 0.9, 4.4 = 4 * 1.1).
+  - In the fourth retry, the job delay time is between 7.2 seconds and 8.8 
seconds (7.2 = 8 * 0.9, 8.8 = 8 * 1.1).
+  - Random values can prevent multiple jobs restart at the same time, so it is 
not recommended to set jitter-factor to 0 in the production environment.
+- `attempts-before-reset-backoff = 8` means that if the job still has 
exception after 8 consecutive retries, it will fail (no more retries).
+- `reset-backoff-threshold = 6 min` means that when the job has lasted for 6 
minutes without an exception, the delay time and retry count will be reset. 

Review Comment:
   `has lasted` -> `runs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-33739][doc] Document FLIP-364: Improve the exponential-delay restart-strategy [flink]

Reply via email to