[ 
https://issues.apache.org/jira/browse/FLINK-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252042#comment-15252042
 ] 

ASF GitHub Bot commented on FLINK-3800:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/1923

    [FLINK-3800] [jobmanager] Terminate ExecutionGraphs properly

    This PR terminates the ExecutionGraphs properly without restarts when the 
JobManager calls
    cancelAndClearEverything. It is achieved by allowing the method to be only 
called with an
    SuppressRestartsException. The SuppressRestartsException will disable the 
restart strategy of
    the respective ExecutionGraph. This is important because the root cause 
could be a different
    exception. In order to avoid race conditions, the restart strategy has to 
be checked twice
    whether it allows to restart the job: Once before and once after the job 
has transitioned to
    the state RESTARTING. This avoids that ExecutionGraphs can become orphans.
    
    Furthermore, this PR fixes the problem that the default restart strategy is 
shared by multiple
    jobs. The problem is solved by introducing a RestartStrategyFactory which 
creates for every
    job its own instance of a RestartStrategy.
    
    - [X] General
      - The pull request references the related JIRA issue
      - The pull request addresses only one issue
      - Each commit in the PR has a meaningful commit message
    
    - [X] Tests & Build
      - Functionality added by the pull request is covered by tests
      - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixJobRestart

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1923.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1923
    
----
commit ea05ae102428f6be8db4091b849b680112099c36
Author: Till Rohrmann <[email protected]>
Date:   2016-04-21T15:07:51Z

    [FLINK-3800] [jobmanager] Terminate ExecutionGraphs properly
    
    This PR terminates the ExecutionGraphs properly without restarts when the 
JobManager calls
    cancelAndClearEverything. It is achieved by allowing the method to be only 
called with an
    SuppressRestartsException. The SuppressRestartsException will disable the 
restart strategy of
    the respective ExecutionGraph. This is important because the root cause 
could be a different
    exception. In order to avoid race conditions, the restart strategy has to 
be checked twice
    whether it allwos to restart the job: Once before and once after the job 
has transitioned to
    the state RESTARTING. This avoids that ExecutionGraphs can become an orphan.
    
    Furhtermore, this PR fixes the problem that the default restart strategy is 
shared by multiple
    jobs. The problem is solved by introducing a RestartStrategyFactory which 
creates for every
    job its own instance of a RestartStrategy.

----


> ExecutionGraphs can become orphans
> ----------------------------------
>
>                 Key: FLINK-3800
>                 URL: https://issues.apache.org/jira/browse/FLINK-3800
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> The {{JobManager.cancelAndClearEverything}} method fails all currently 
> executed jobs on the {{JobManager}} and then clears the list of 
> {{currentJobs}} kept in the JobManager. This can become problematic if the 
> user has set a restart strategy for a job, because the {{RestartStrategy}} 
> will try to restart the job. This can lead to unwanted re-deployments of the 
> job which consumes resources and thus will trouble the execution of other 
> jobs. If the restart strategy never stops, then this prevents that the 
> {{ExecutionGraph}} from ever being properly terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to