Myracle commented on code in PR #27484:
URL: https://github.com/apache/flink/pull/27484#discussion_r2867223019


##########
flink-core/src/main/java/org/apache/flink/configuration/CheckpointingOptions.java:
##########
@@ -527,6 +527,28 @@ public class CheckpointingOptions {
                                                     
CHECKPOINTING_INTERVAL_DURING_BACKLOG.key()))
                                     .build());
 
+    /**
+     * The initial delay before the first checkpoint is triggered after the 
job starts.
+     *
+     * <p>This is useful for jobs that need time to warm up or catch up with 
backlogs before
+     * performing the first checkpoint.
+     */
+    @PublicEvolving
+    public static final ConfigOption<Duration> CHECKPOINTING_INITIAL_DELAY =
+            ConfigOptions.key("execution.checkpointing.initial-delay")
+                    .durationType()
+                    .defaultValue(Duration.ZERO)
+                    .withDescription(
+                            Description.builder()
+                                    .text(
+                                            "The initial delay before the 
first checkpoint is triggered after the job starts. "
+                                                    + "This is useful for jobs 
that need time to warm up or catch up with backlogs. "
+                                                    + "If set to 0 (default), 
the initial delay will be randomly chosen between "

Review Comment:
   Thanks for raising this concern!
   I'd like to clarify the behavior for the explicitly configured initial-delay 
case (i.e., initialCheckpointDelay > 0):
   
   In this branch, the jitter added on top of the user-configured delay is 
already bounded by Math.min(baseInterval, 60_000L), which means the maximum 
additional jitter is capped at 60 seconds, regardless of how large 
CHECKPOINTING_INTERVAL is. So even with a very large checkpoint interval (e.g., 
1 hour), a user who sets initial-delay = 30s would see an actual delay in the 
range of [30s, 90s] — not anywhere close to 1 hour.
   
   Using minPause as the jitter bound instead could be problematic in some 
cases: when minPause is 0 (which is the default), the jitter would also be 0, 
effectively eliminating the thundering-herd protection that the randomization 
is designed to provide.
   
   That said, your concern is very valid for the default branch 
(initialCheckpointDelay == 0), where the original random range was [minPause, 
baseInterval] — that could indeed produce a very long delay when the checkpoint 
interval is large. I've already addressed this in a separate change by 
switching the default to minPause + random(0, minPause), which keeps the total 
delay within [minPause, 2 * minPause].
   
   To summarize:
   - Configured initial-delay > 0: jitter is already capped at 60s — no risk of 
excessively long delays.
   - Default initial-delay = 0: already fixed to use minPause + 
jitter(minPause) per your suggestion.
   
   Let me know if you think the 60-second cap for the configured case should 
also be tightened further!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to