GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/4933
[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices ## What is the purpose of the change One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink hardenExecutionGraphRestartTest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4933.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4933 ---- commit 76659cce74afb3045164c522a91b1b1688e34f38 Author: Till Rohrmann <trohrm...@apache.org> Date: 2017-11-01T15:23:51Z [hotfix] Make WaitForTasks using an AtomicInteger commit b1701e31305b05488a6ff6b0c305193a13a68637 Author: Till Rohrmann <trohrm...@apache.org> Date: 2017-11-01T15:53:14Z [FLINK-7352] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. ---- ---