[
https://issues.apache.org/jira/browse/FLINK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346310#comment-15346310
]
ASF GitHub Bot commented on FLINK-4046:
---------------------------------------
Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2095
Thanks for the review @uce. I think you're right and we should also accept
the `FAILED` state when calling `restart` and simply print a log message
instead of throwing the `IllegalStateException`. But I'm not so sure for any
terminal state because we should not have reached the state `FINISHED`. This
usually indicates a failure and should not go unnoticed.
I will add the `FAILED` check and then merge the PR.
> Failing a restarting job can get stuck in JobStatus.FAILING
> -----------------------------------------------------------
>
> Key: FLINK-4046
> URL: https://issues.apache.org/jira/browse/FLINK-4046
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.1.0
> Reporter: Till Rohrmann
> Fix For: 1.1.0
>
>
> When a job is in state {{RESTARTING}}, then it can happen that all of its
> {{ExecutionJobVertices}} are in a final state (if they have not been reset).
> When calling {{fail}} on this {{ExecutionGraph}} will transition the state to
> {{FAILING}} and call cancel on all {{ExecutionJobVertices}}. The job state
> {{FAILING}} can only be left iff all {{ExecutionJobVertices}} have reached a
> final state. The notification of this final state is only sent to the
> {{ExecutionGraph}} when all subtasks of an {{ExecutionJobVertex}} have
> transitioned to a final state. However, this won't happen because the
> {{ExeuctionJobVertices}} are already in a final state. The result is that a
> job can get stuck in the state {{FAILING}} if {{fail}} is called on a
> {{RESTARTING}} job.
> I propose to add a direct transition from {{RESTARTING}} to {{FAILED}} as it
> is the case for the {{cancel}} call (transition from {{RESTARTING}} to
> {{CANCELED}}).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)