pnowojski commented on a change in pull request #15728:
URL: https://github.com/apache/flink/pull/15728#discussion_r621187613



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/DefaultCheckpointPlanCalculator.java
##########
@@ -111,7 +112,10 @@ public void setAllowCheckpointsAfterTasksFinished(boolean 
allowCheckpointsAfterT
                                         ? calculateAfterTasksFinished()
                                         : calculateWithAllTasksRunning();
 
-                        checkTasksStarted(result.getTasksToTrigger());
+                        checkTasksStarted(
+                                isUnalignedCheckpoint
+                                        ? result.getTasksToWaitFor()
+                                        : result.getTasksToTrigger());

Review comment:
       > So all in all, only high backpressure cases would really suffer from 
that change and we want to encourage UC for them.
   
   > In my opinion, the only first point makes sense. It is better to have the 
first checkpoint sooner rather than later. But I still don't understand how it 
is important because for UC which we want to have as primal one, we don't 
support such behaviour. So my position is to have the same behaviour for both 
checkpoints. If we think that the delay in starting of the first checkpoint is 
crucial then we should support it for UC(maybe not in this ticket but in 
general). but if we think that it is not so important then we can remove this 
support from AC.
   
   I think those are good arguments.
   
   However there is also one more case which I would be worried about. What if 
recovery is a thing that's causing this huge backpressure, and apart of that, 
user is happy by using AC? For example recovery takes 1h, checkpointing 
interval is 10minutes, recovery creates back pressure worth of 1h (every minute 
sources are running while some downstream task is not processing records is 
creating a minute backlog of records). Currently first checkpoint would 
complete after 1h 10 minutes. If we wait for recovery to complete before 
triggering first checkpoint, first checkpoint would complete after 2h.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to