GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/4933
[FLINK-7960] [tests] Fix race conditions in
ExecutionGraphRestartTest#completeCancellingForAllVertices
## What is the purpose of the change
One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING,
1000) and the
subsequent completeCancellingForAllVertices where not all execution are in
state
CANCELLING.
The other race condition is between completeCancellingForAllVertices and
the fixed
delay restart without a delay. The problem is that the 10th task could have
failed.
In order to restart we would have to complete the cancel for the first 9
tasks. This
is enough for the restart strategy to restart the job. If this happens
before
completeCancellingForAllVertices has also cancelled the execution of the
10th task,
it could happen that we cancel a fresh execution.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
## Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink
hardenExecutionGraphRestartTest
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4933.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4933
----
commit 76659cce74afb3045164c522a91b1b1688e34f38
Author: Till Rohrmann <[email protected]>
Date: 2017-11-01T15:23:51Z
[hotfix] Make WaitForTasks using an AtomicInteger
commit b1701e31305b05488a6ff6b0c305193a13a68637
Author: Till Rohrmann <[email protected]>
Date: 2017-11-01T15:53:14Z
[FLINK-7352] [tests] Fix race conditions in
ExecutionGraphRestartTest#completeCancellingForAllVertices
One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING,
1000) and the
subsequent completeCancellingForAllVertices where not all execution are in
state
CANCELLING.
The other race condition is between completeCancellingForAllVertices and
the fixed
delay restart without a delay. The problem is that the 10th task could have
failed.
In order to restart we would have to complete the cancel for the first 9
tasks. This
is enough for the restart strategy to restart the job. If this happens
before
completeCancellingForAllVertices has also cancelled the execution of the
10th task,
it could happen that we cancel a fresh execution.
----
---