[GitHub] flink pull request #4364: [FLINK-7216] [distr. coordination] Guard against c...

StephanEwen Wed, 19 Jul 2017 01:34:37 -0700

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/4364


    [FLINK-7216] [distr. coordination] Guard against concurrent global failover

    **This is one of the blocker issues for the 1.3.2 release.**
    
    ## What is the purpose of the change
    
    This fixed the bug 
[FLINK-7216](https://issues.apache.org/jira/browse/FLINK-7216) where some race 
conditions can trigger concurrent failovers, triggering a restart-storm.
    
    The heart of the bug is the fact that we allow initiating another restart 
while already being in state `RESTARTING`. That was introduced as a safety net 
to catch exceptions (implementation bugs) that are reported in that state and 
need a full recovery to ensure consistency.
    
    However, this means that accidentally, multiple restarts may be 
triggered/queued and then execute after another. While one attempt is executing 
the failover, the next one will interfere or abort (as detected conflicting) 
and schedule another recovery, leading to the above mentioned restart storm. 
The restart storm subsides once one restart attempt makes enough progress 
(before the other interferes) to actually finish the scheduling phase.
    
    ## Brief change log
    
    This contains three issues, because the first two were needed for a 
preparing the fix.
      - [FLINK-6665](https://issues.apache.org/jira/browse/FLINK-6665) and 
[FLINK-6667](https://issues.apache.org/jira/browse/FLINK-6667) introduce an 
indirection where the `RestartStrategy` does no longer call `restart()` on the 
`ExecutionGraph` directly. Instead, they call a callback to initiate the 
restart.
      - The actual fix makes sure that the `globalModVersion` (which tracks 
global changes such as full restarts in the ExecutionGraph) is unchanged 
between triggering the restart and executing it. When scheduling multiple 
restart requests, only one will actually take effect, while the others detect 
being subsumed.
    
    ## Verifying this change
    
    This change added the following tests:
      - `ExecutionGraphRestartTest#testConcurrentGlobalFailAndRestarts()` tests 
explicitly that setting
      - `ExecutionGraphRestartTest#testConcurrentLocalFailAndRestart()` tests a 
similar setup 
    
    The general working of that mechanism is also covered by various existing 
test in `org.apache.flink.runtime.executiongraph.restart`
    
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): **no**
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
      - The serializers: **no**
      - The runtime per-record code paths (performance sensitive): **no**
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes**:
    
    It the change affects the restart logic on the `JobManager`.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? **no**
      - If yes, how is the feature documented? **not applicable**
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink 
concurrent_restarts_13

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4364
    
----
commit 1abb816d664bdac9d8b9af438769b9f685e768ce
Author: zjureel <[email protected]>
Date:   2017-07-18T17:27:56Z

    [FLINK-6665] [FLINK-6667] [distributed coordination] Use a callback and a 
ScheduledExecutor for ExecutionGraph restarts
    
    Initial work by [email protected] , improved by [email protected].

commit ef88524c808766e08d990f3bb69c45b04807c7c2
Author: Stephan Ewen <[email protected]>
Date:   2017-07-18T17:49:56Z

    [FLINK-7216] [distr. coordination] Guard against concurrent global failover

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4364: [FLINK-7216] [distr. coordination] Guard against c...

Reply via email to