GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6155

    [FLINK-9494] Fix race condition in Dispatcher with granting and revoking 
leadership

    ## What is the purpose of the change
    
    The race condition was caused by the fact that the job recovery is executed 
outside
    of the main thread. Only after the recovery finishes, the Dispatcher will 
set the new
    fencing token and start the recovered jobs. The problem arose if in between 
these two
    operations the Dispatcher gets its leadership revoked. Then it could happen 
that the
    Dispatcher tries to run the recovered jobs even though it no longer holds 
the leadership.
    
    The race condition is solved by checking first whether we still hold the 
leadership which
    is identified by the given leader session id.
    
    This PR is based on #6154.
    
    cc @StefanRRichter 
    
    ## Brief change log
    
    - After recovering the jobs, check whether the leader session is still 
valid before assigning the fencing token and the starting the recovered jobs
    
    ## Verifying this change
    
    - Added `DispatcherHATest`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixDispatcherRaceCondition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6155
    
----
commit 54831c6a07dfcd5691fd732148a7c559514362ec
Author: Till Rohrmann <trohrmann@...>
Date:   2018-06-12T13:22:50Z

    [hotfix] Remove Scala dependency from ZooKeeperLeaderElectionTest

commit e50034088f4d85fe457e7015f162a6f86b1de9e7
Author: Till Rohrmann <trohrmann@...>
Date:   2018-06-12T12:24:59Z

    [FLINK-9573] Extend LeaderElectionService#hasLeadership to take leader 
session id
    
    The new LeaderElectionService#hasLeadership also takes the leader session 
id and verifies whether
    this is the correct leader session id associated with the leadership.

commit 104b46bd7848cd431afc564f6d3bb364a5257cf9
Author: Till Rohrmann <trohrmann@...>
Date:   2018-06-12T12:40:13Z

    [hotfix] Fix checkstyle violations in SingleLeaderElectionService

commit b5f9ee50dc208b767d56bc74a1705b853b10aa7c
Author: Till Rohrmann <trohrmann@...>
Date:   2018-06-12T12:26:15Z

    [hotfix] Make TestingDispatcher and TestingJobManagerRunnerFactory 
standalone testing classes
    
    Refactors the DispatcherTests and moves the TestingDispatcher and the 
TestingJobManagerRunnerFactory
    to be top level classes. This makes it easier to reuse them.

commit 190de76b21c3710d0fcbe5d66018c5853707cc33
Author: Till Rohrmann <trohrmann@...>
Date:   2018-06-12T12:27:30Z

    [FLINK-9494] Fix race condition in Dispatcher with granting and revoking 
leadership
    
    The race condition was caused by the fact that the job recovery is executed 
outside
    of the main thread. Only after the recovery finishes, the Dispatcher will 
set the new
    fencing token and start the recovered jobs. The problem arose if in between 
these two
    operations the Dispatcher gets its leadership revoked. Then it could happen 
that the
    Dispatcher tries to run the recovered jobs even though it no longer holds 
the leadership.
    
    The race condition is solved by checking first whether we still hold the 
leadership which
    is identified by the given leader session id.

----


---

Reply via email to