AHeise commented on a change in pull request #12478:
URL: https://github.com/apache/flink/pull/12478#discussion_r437232512
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
##########
@@ -56,7 +56,7 @@
public class ChannelStateWriterImpl implements ChannelStateWriter {
private static final Logger LOG =
LoggerFactory.getLogger(ChannelStateWriterImpl.class);
- private static final int DEFAULT_MAX_CHECKPOINTS = 5; // currently,
only single in-flight checkpoint is supported
+ private static final int DEFAULT_MAX_CHECKPOINTS = 100; // includes
max-concurrent-checkpoints + checkpoints to be aborted (scheduled via mailbox)
Review comment:
Just thinking loud.
Previously, the issue was that sources received new RPC triggers while being
stuck, which enqueued a ton of mails.
With a checkpointing interval of 100ms, you only need to be stuck for 10s
until you enqueue 100 mails and hit the limit.
But I guess, the assumption is that now `notifyCheckpointAborted` is called
reliably. And if doesn't, we need to fail probably anyways, since something is
broken.
So I guess this solution is as good as it gets. I trade-off between quickly
identifying bugs in abort and running potentially into some issues with ultra
slow tasks and ultra fast checkpoint barriers.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]