[GitHub] [flink] rmetzger commented on a change in pull request #16229: [FLINK-22464][tests] Fix OperatorCoordinator test which is stalling with AdaptiveScheduler

GitBox Tue, 22 Jun 2021 01:18:48 -0700


rmetzger commented on a change in pull request #16229:
URL: https://github.com/apache/flink/pull/16229#discussion_r655966742




##########
File path: 
flink-tests/src/test/java/org/apache/flink/runtime/operators/coordination/OperatorEventSendingCheckpointITCase.java
##########
@@ -219,6 +223,30 @@ public Long map(Long value) throws Exception {
                                         if (++num > failAt) {
                                             throw new Exception("Artificial 
intermittent failure.");
                                         }
+                                        // This test depends on checkpoints 
persisting progress from
+                                        // the source before above exception 
gets triggered.
+                                        // Otherwise, the job will run for a 
long time (or forever)
+                                        // because above exception will be 
thrown before any
+                                        // checkpoint successfully completes.
+                                        //
+                                        // Checkpoints are triggered once the 
checkpoint scheduler
+                                        // gets started + a random initial 
delay.
+                                        // For DefaultScheduler, this 
mechanism is fine, because DS
+                                        // starts the checkpoint coordinator, 
then requests the
+                                        // required slots and then deploys the 
tasks. These
+                                        // operations take enough time to have 
a checkpoint
+                                        // triggered by the time the task 
starts running.
+                                        // AdaptiveScheduler starts the 
CheckpointCoordinator right
+                                        // before deploying tasks (when slots 
are available
+                                        // already), hence tasks will start 
running almost
+                                        // immediately, and the checkpoint 
gets triggered too late
+                                        // (it won't be able to complete 
before the artificial
+                                        // failure from this test)
+                                        // When we detect AdaptiveScheduler, 
we artificially slow
+                                        // down
+                                        if (adaptiveSchedulerEnabled) {
+                                            Thread.sleep(1);

Review comment:
       Thanks for taking a look at the change! After an insane amount of time 
trying to figure out this problem, I was so excited that I forgot the golden 
rule for tests #1: don't use `Thread.sleep` 🤕 . I'll look into alternatives.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] rmetzger commented on a change in pull request #16229: [FLINK-22464][tests] Fix OperatorCoordinator test which is stalling with AdaptiveScheduler

Reply via email to