rmetzger commented on a change in pull request #16229:
URL: https://github.com/apache/flink/pull/16229#discussion_r655966742
##########
File path:
flink-tests/src/test/java/org/apache/flink/runtime/operators/coordination/OperatorEventSendingCheckpointITCase.java
##########
@@ -219,6 +223,30 @@ public Long map(Long value) throws Exception {
if (++num > failAt) {
throw new Exception("Artificial
intermittent failure.");
}
+ // This test depends on checkpoints
persisting progress from
+ // the source before above exception
gets triggered.
+ // Otherwise, the job will run for a
long time (or forever)
+ // because above exception will be
thrown before any
+ // checkpoint successfully completes.
+ //
+ // Checkpoints are triggered once the
checkpoint scheduler
+ // gets started + a random initial
delay.
+ // For DefaultScheduler, this
mechanism is fine, because DS
+ // starts the checkpoint coordinator,
then requests the
+ // required slots and then deploys the
tasks. These
+ // operations take enough time to have
a checkpoint
+ // triggered by the time the task
starts running.
+ // AdaptiveScheduler starts the
CheckpointCoordinator right
+ // before deploying tasks (when slots
are available
+ // already), hence tasks will start
running almost
+ // immediately, and the checkpoint
gets triggered too late
+ // (it won't be able to complete
before the artificial
+ // failure from this test)
+ // When we detect AdaptiveScheduler,
we artificially slow
+ // down
+ if (adaptiveSchedulerEnabled) {
+ Thread.sleep(1);
Review comment:
Thanks for taking a look at the change! After an insane amount of time
trying to figure out this problem, I was so excited that I forgot the golden
rule for tests #1: don't use `Thread.sleep` 🤕 . I'll look into alternatives.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]