[
https://issues.apache.org/jira/browse/FLINK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327080#comment-15327080
]
ASF GitHub Bot commented on FLINK-4046:
---------------------------------------
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2095
[FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED
A job can get stuck in FAILING if fail is called on a restarting job which
has
not yet reset its ExecutionJobVertices, because these vertices would not
call
jobVertexInFinalState. This method, however, must be called in order to
transition
from FAILING to FAILED. In order to solve the problem, this PR introduces a
direct
state transition from `RESTARTING` to `FAILED`, if `fail` is called when
being in state
`RESTARTING`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink fixFailWhileRestarting
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2095.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2095
----
commit 094c6b59eb92cb5a0f3bf41aa92aab399ba4127c
Author: Till Rohrmann <[email protected]>
Date: 2016-06-09T13:54:01Z
[FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED
A job can get stuck in FAILING if fail is called on a restarting job which
has
not yet reset its ExecutionJobVertices, because these vertices would not
call
jobVertexInFinalState. This method, however, must be called in order to
transition
from FAILING to FAILED.
----
> Failing a restarting job can get stuck in JobStatus.FAILING
> -----------------------------------------------------------
>
> Key: FLINK-4046
> URL: https://issues.apache.org/jira/browse/FLINK-4046
> Project: Flink
> Issue Type: Bug
> Components: Distributed Runtime
> Affects Versions: 1.1.0
> Reporter: Till Rohrmann
> Fix For: 1.1.0
>
>
> When a job is in state {{RESTARTING}}, then it can happen that all of its
> {{ExecutionJobVertices}} are in a final state (if they have not been reset).
> When calling {{fail}} on this {{ExecutionGraph}} will transition the state to
> {{FAILING}} and call cancel on all {{ExecutionJobVertices}}. The job state
> {{FAILING}} can only be left iff all {{ExecutionJobVertices}} have reached a
> final state. The notification of this final state is only sent to the
> {{ExecutionGraph}} when all subtasks of an {{ExecutionJobVertex}} have
> transitioned to a final state. However, this won't happen because the
> {{ExeuctionJobVertices}} are already in a final state. The result is that a
> job can get stuck in the state {{FAILING}} if {{fail}} is called on a
> {{RESTARTING}} job.
> I propose to add a direct transition from {{RESTARTING}} to {{FAILED}} as it
> is the case for the {{cancel}} call (transition from {{RESTARTING}} to
> {{CANCELED}}).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)