GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/6155
[FLINK-9494] Fix race condition in Dispatcher with granting and revoking
leadership
## What is the purpose of the change
The race condition was caused by the fact that the job recovery is executed
outside
of the main thread. Only after the recovery finishes, the Dispatcher will
set the new
fencing token and start the recovered jobs. The problem arose if in between
these two
operations the Dispatcher gets its leadership revoked. Then it could happen
that the
Dispatcher tries to run the recovered jobs even though it no longer holds
the leadership.
The race condition is solved by checking first whether we still hold the
leadership which
is identified by the given leader session id.
This PR is based on #6154.
cc @StefanRRichter
## Brief change log
- After recovering the jobs, check whether the leader session is still
valid before assigning the fencing token and the starting the recovered jobs
## Verifying this change
- Added `DispatcherHATest`
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink fixDispatcherRaceCondition
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/6155.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6155
----
commit 54831c6a07dfcd5691fd732148a7c559514362ec
Author: Till Rohrmann <trohrmann@...>
Date: 2018-06-12T13:22:50Z
[hotfix] Remove Scala dependency from ZooKeeperLeaderElectionTest
commit e50034088f4d85fe457e7015f162a6f86b1de9e7
Author: Till Rohrmann <trohrmann@...>
Date: 2018-06-12T12:24:59Z
[FLINK-9573] Extend LeaderElectionService#hasLeadership to take leader
session id
The new LeaderElectionService#hasLeadership also takes the leader session
id and verifies whether
this is the correct leader session id associated with the leadership.
commit 104b46bd7848cd431afc564f6d3bb364a5257cf9
Author: Till Rohrmann <trohrmann@...>
Date: 2018-06-12T12:40:13Z
[hotfix] Fix checkstyle violations in SingleLeaderElectionService
commit b5f9ee50dc208b767d56bc74a1705b853b10aa7c
Author: Till Rohrmann <trohrmann@...>
Date: 2018-06-12T12:26:15Z
[hotfix] Make TestingDispatcher and TestingJobManagerRunnerFactory
standalone testing classes
Refactors the DispatcherTests and moves the TestingDispatcher and the
TestingJobManagerRunnerFactory
to be top level classes. This makes it easier to reuse them.
commit 190de76b21c3710d0fcbe5d66018c5853707cc33
Author: Till Rohrmann <trohrmann@...>
Date: 2018-06-12T12:27:30Z
[FLINK-9494] Fix race condition in Dispatcher with granting and revoking
leadership
The race condition was caused by the fact that the job recovery is executed
outside
of the main thread. Only after the recovery finishes, the Dispatcher will
set the new
fencing token and start the recovered jobs. The problem arose if in between
these two
operations the Dispatcher gets its leadership revoked. Then it could happen
that the
Dispatcher tries to run the recovered jobs even though it no longer holds
the leadership.
The race condition is solved by checking first whether we still hold the
leadership which
is identified by the given leader session id.
----
---