[ 
https://issues.apache.org/jira/browse/FLINK-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344181#comment-15344181
 ] 

ASF GitHub Bot commented on FLINK-3800:
---------------------------------------

Github user uce commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2096#discussion_r68039992
  
    --- Diff: docs/internals/job_scheduling.md ---
    @@ -74,7 +74,28 @@ Besides the vertices, the ExecutionGraph also contains 
the {% gh_link /flink-run
     <img src="fig/job_and_execution_graph.svg" alt="JobGraph and 
ExecutionGraph" height="400px" style="text-align: center;"/>
     </div>
     
    -During its execution, each parallel task goes through multiple stages, 
from *created* to *finished* or *failed*. The diagram below illustrates the 
    +Each ExecutionGraph has a job status associated with it.
    +This job status indicates the current state of the job execution.
    +
    +A Flink job is first in the *created* state, then switches to *running* 
and upon completion of all work it switches to *finished*.
    +In case of failures, a job switches first to *failing* where it cancels 
all running tasks.
    +If all job vertices have reached a final state and the job is not 
restartable, then the job transitions to *failed*.
    +If the job can be restarted, then it will enter the *restarting* state.
    +Once the job has been completely restarted, it will reach the *created* 
state.
    +
    +In case that the user cancels the job, it will go into the *cancelling* 
state.
    +This is also entails the cancellation of all currently running tasks.
    +Once all running tasks have reached a final state, the job transitions to 
the state *cancelled*.
    +
    +Unlike the states *finished*, *canceled* and *failed* which denote a 
globally terminal state and, thus, trigger the clean up of the job, the 
*suspended* state is only locally terminal.
    +Locally terminal means that the execution of the job has been terminated 
on the respective JobManager but another JobManager of the Flink cluster can 
retrieve the job from the persistent HA store and restart it.
    +Consequently, a job which reaches the *suspended* state won't be 
completely cleaned up.
    +
    +<div style="text-align: center;">
    +<img src="fig/job_status.svg" alt="States and Transitions of Flink job" 
height="500px" style="text-align: center;"/>
    --- End diff --
    
    Very nice figure! This will future contributors a lot.


> ExecutionGraphs can become orphans
> ----------------------------------
>
>                 Key: FLINK-3800
>                 URL: https://issues.apache.org/jira/browse/FLINK-3800
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> The {{JobManager.cancelAndClearEverything}} method fails all currently 
> executed jobs on the {{JobManager}} and then clears the list of 
> {{currentJobs}} kept in the JobManager. This can become problematic if the 
> user has set a restart strategy for a job, because the {{RestartStrategy}} 
> will try to restart the job. This can lead to unwanted re-deployments of the 
> job which consumes resources and thus will trouble the execution of other 
> jobs. If the restart strategy never stops, then this prevents that the 
> {{ExecutionGraph}} from ever being properly terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to