[jira] [Commented] (FLINK-4046) Failing a restarting job can get stuck in JobStatus.FAILING

ASF GitHub Bot (JIRA) Mon, 13 Jun 2016 02:44:53 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327080#comment-15327080
 ]


ASF GitHub Bot commented on FLINK-4046:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2095

    [FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED

    A job can get stuck in FAILING if fail is called on a restarting job which 
has
    not yet reset its ExecutionJobVertices, because these vertices would not 
call
    jobVertexInFinalState. This method, however, must be called in order to 
transition
    from FAILING to FAILED. In order to solve the problem, this PR introduces a 
direct
    state transition from `RESTARTING` to `FAILED`, if `fail` is called when 
being in state 
    `RESTARTING`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixFailWhileRestarting

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2095
    
----
commit 094c6b59eb92cb5a0f3bf41aa92aab399ba4127c
Author: Till Rohrmann <[email protected]>
Date:   2016-06-09T13:54:01Z

    [FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED
    
    A job can get stuck in FAILING if fail is called on a restarting job which 
has
    not yet reset its ExecutionJobVertices, because these vertices would not 
call
    jobVertexInFinalState. This method, however, must be called in order to 
transition
    from FAILING to FAILED.

----


> Failing a restarting job can get stuck in JobStatus.FAILING
> -----------------------------------------------------------
>
>                 Key: FLINK-4046
>                 URL: https://issues.apache.org/jira/browse/FLINK-4046
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 1.1.0
>            Reporter: Till Rohrmann
>             Fix For: 1.1.0
>
>
> When a job is in state {{RESTARTING}}, then it can happen that all of its 
> {{ExecutionJobVertices}} are in a final state (if they have not been reset). 
> When calling {{fail}} on this {{ExecutionGraph}} will transition the state to 
> {{FAILING}} and call cancel on all {{ExecutionJobVertices}}. The job state 
> {{FAILING}} can only be left iff all {{ExecutionJobVertices}} have reached a 
> final state. The notification of this final state is only sent to the 
> {{ExecutionGraph}} when all subtasks of an {{ExecutionJobVertex}} have 
> transitioned to a final state. However, this won't happen because the 
> {{ExeuctionJobVertices}} are already in a final state. The result is that a 
> job can get stuck in the state {{FAILING}} if {{fail}} is called on a 
> {{RESTARTING}} job.
> I propose to add a direct transition from {{RESTARTING}} to {{FAILED}} as it 
> is the case for the {{cancel}} call (transition from {{RESTARTING}} to 
> {{CANCELED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4046) Failing a restarting job can get stuck in JobStatus.FAILING

Reply via email to