[GitHub] [flink] rmetzger commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

GitBox Tue, 06 Oct 2020 04:29:17 -0700


rmetzger commented on a change in pull request #13540:
URL: https://github.com/apache/flink/pull/13540#discussion_r500200594




##########
File path: 
flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws 
Exception {
                final TestingJobManagerRunner testingJobManagerRunner = 
jobManagerRunnerFactory.takeCreatedJobManagerRunner();
                testingJobManagerRunner.completeResultFutureExceptionally(new 
JobNotFinishedException(jobId));
 
+               // wait until termination JobManagerRunner closeAsync has been 
called.
+               // this is necessary to avoid race conditions with completion 
of the 1st job and the submission of the 2nd job 
(DuplicateJobSubmissionException).
+               testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       Thanks a lot for your review.
   
   
   The `DuplicateJobSubmissionException` exception is thrown, because the same 
job is already in the  `runningJobs` list. 
   
   I believe there is a race condition between the main thread executing the 
test and the dispatcher main thread: 
   (dispatcher thread) As soon as the `TestingJobManagerRunner` has been 
created, it is offered to a queue in the `TestingJobManagerRunnerFactory`. The 
(main test thread) is waiting on the queue. The next step for this thread will 
be updating the DispatcherJob's status, and setting up a forward of the result 
future from the jobManagerRunner.
   (main test thread) the `TestingJobManagerRunner` is available now, so the 
thread can continue. The thread uses the `TestingJobManagerRunner` to complete 
the result future. The result future will be forwarded in `DispatcherJob`. 
However, if (dispatcher thread) is not fast enough in registering the forward, 
the main thread might manage to submit the 2nd job before (dispatcher thread) 
forwards the completion.
   
   I can reproduce the problem locally by introducing a sleep for 10 
milliseconds after offering the `TestingJobManagerRunner` in the 
`TestingJobManagerRunnerFactory`. This will guarantee that the result forward 
setup is always happening too late.
   
   In my opinion this test instability is purely an issue of the test, not the 
Dispatcher implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] rmetzger commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Reply via email to