GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/4933

    [FLINK-7960] [tests] Fix race conditions in 
ExecutionGraphRestartTest#completeCancellingForAllVertices

    ## What is the purpose of the change
    
    One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 
1000) and the
    subsequent completeCancellingForAllVertices where not all execution are in 
state
    CANCELLING.
    
    The other race condition is between completeCancellingForAllVertices and 
the fixed
    delay restart without a delay. The problem is that the 10th task could have 
failed.
    In order to restart we would have to complete the cancel for the first 9 
tasks. This
    is enough for the restart strategy to restart the job. If this happens 
before
    completeCancellingForAllVertices has also cancelled the execution of the 
10th task,
    it could happen that we cancel a fresh execution.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink 
hardenExecutionGraphRestartTest

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4933
    
----
commit 76659cce74afb3045164c522a91b1b1688e34f38
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2017-11-01T15:23:51Z

    [hotfix] Make WaitForTasks using an AtomicInteger

commit b1701e31305b05488a6ff6b0c305193a13a68637
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2017-11-01T15:53:14Z

    [FLINK-7352] [tests] Fix race conditions in 
ExecutionGraphRestartTest#completeCancellingForAllVertices
    
    One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 
1000) and the
    subsequent completeCancellingForAllVertices where not all execution are in 
state
    CANCELLING.
    
    The other race condition is between completeCancellingForAllVertices and 
the fixed
    delay restart without a delay. The problem is that the 10th task could have 
failed.
    In order to restart we would have to complete the cancel for the first 9 
tasks. This
    is enough for the restart strategy to restart the job. If this happens 
before
    completeCancellingForAllVertices has also cancelled the execution of the 
10th task,
    it could happen that we cancel a fresh execution.

----


---

Reply via email to