GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/4364
[FLINK-7216] [distr. coordination] Guard against concurrent global failover
**This is one of the blocker issues for the 1.3.2 release.**
## What is the purpose of the change
This fixed the bug
[FLINK-7216](https://issues.apache.org/jira/browse/FLINK-7216) where some race
conditions can trigger concurrent failovers, triggering a restart-storm.
The heart of the bug is the fact that we allow initiating another restart
while already being in state `RESTARTING`. That was introduced as a safety net
to catch exceptions (implementation bugs) that are reported in that state and
need a full recovery to ensure consistency.
However, this means that accidentally, multiple restarts may be
triggered/queued and then execute after another. While one attempt is executing
the failover, the next one will interfere or abort (as detected conflicting)
and schedule another recovery, leading to the above mentioned restart storm.
The restart storm subsides once one restart attempt makes enough progress
(before the other interferes) to actually finish the scheduling phase.
## Brief change log
This contains three issues, because the first two were needed for a
preparing the fix.
- [FLINK-6665](https://issues.apache.org/jira/browse/FLINK-6665) and
[FLINK-6667](https://issues.apache.org/jira/browse/FLINK-6667) introduce an
indirection where the `RestartStrategy` does no longer call `restart()` on the
`ExecutionGraph` directly. Instead, they call a callback to initiate the
restart.
- The actual fix makes sure that the `globalModVersion` (which tracks
global changes such as full restarts in the ExecutionGraph) is unchanged
between triggering the restart and executing it. When scheduling multiple
restart requests, only one will actually take effect, while the others detect
being subsumed.
## Verifying this change
This change added the following tests:
- `ExecutionGraphRestartTest#testConcurrentGlobalFailAndRestarts()` tests
explicitly that setting
- `ExecutionGraphRestartTest#testConcurrentLocalFailAndRestart()` tests a
similar setup
The general working of that mechanism is also covered by various existing
test in `org.apache.flink.runtime.executiongraph.restart`
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): **no**
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: **no**
- The serializers: **no**
- The runtime per-record code paths (performance sensitive): **no**
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes**:
It the change affects the restart logic on the `JobManager`.
## Documentation
- Does this pull request introduce a new feature? **no**
- If yes, how is the feature documented? **not applicable**
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink
concurrent_restarts_13
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4364.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4364
----
commit 1abb816d664bdac9d8b9af438769b9f685e768ce
Author: zjureel <[email protected]>
Date: 2017-07-18T17:27:56Z
[FLINK-6665] [FLINK-6667] [distributed coordination] Use a callback and a
ScheduledExecutor for ExecutionGraph restarts
Initial work by [email protected] , improved by [email protected].
commit ef88524c808766e08d990f3bb69c45b04807c7c2
Author: Stephan Ewen <[email protected]>
Date: 2017-07-18T17:49:56Z
[FLINK-7216] [distr. coordination] Guard against concurrent global failover
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---