AHeise commented on a change in pull request #12478:
URL: https://github.com/apache/flink/pull/12478#discussion_r437232512



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
##########
@@ -56,7 +56,7 @@
 public class ChannelStateWriterImpl implements ChannelStateWriter {
 
        private static final Logger LOG = 
LoggerFactory.getLogger(ChannelStateWriterImpl.class);
-       private static final int DEFAULT_MAX_CHECKPOINTS = 5; // currently, 
only single in-flight checkpoint is supported
+       private static final int DEFAULT_MAX_CHECKPOINTS = 100; // includes 
max-concurrent-checkpoints + checkpoints to be aborted (scheduled via mailbox)

Review comment:
       Just thinking loud.
   
   Previously, the issue was that sources received new RPC triggers while being 
stuck, which enqueued a ton of mails.
   With a checkpointing interval of 100ms, you only need to be stuck for 10s 
until you enqueue 100 mails and hit the limit.
   
   But I guess, the assumption is that now `notifyCheckpointAborted` is called 
reliably. And if doesn't, we need to fail probably anyways, since something is 
broken.
   
   So I guess this solution is as good as it gets. I trade-off between quickly 
identifying bugs in abort and running potentially into some issues with ultra 
slow tasks and ultra fast checkpoint barriers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to