[
https://issues.apache.org/jira/browse/FLINK-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155982#comment-15155982
]
ASF GitHub Bot commented on FLINK-3443:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/1669#discussion_r53564976
--- Diff:
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
---
@@ -1487,7 +1487,7 @@ class JobManager(
}
}
- eg.fail(cause)
+ eg.cancel()
--- End diff --
What if we let the job fail with an UnrecoverableException upon JobManager
termination?
On Feb 20, 2016 12:08 AM, "Ufuk Celebi" <[email protected]> wrote:
> In
>
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
> <https://github.com/apache/flink/pull/1669#discussion_r53531860>:
>
> > @@ -1487,7 +1487,7 @@ class JobManager(
> > }
> > }
> >
> > - eg.fail(cause)
> > + eg.cancel()
>
> Good point with the TM logs.
>
> My main reason was that calls to fail (for example a shutdown
> cancelAndClearEverything or shutdown of the InstanceManager) can lead to
> the execution graph being restarted even though the job manager is shut
> down. The cancel call ensures that this does not happen and the execution
> graph eventually enters a terminal state.
>
> The main thing that triggered this change was the following: when you
> start a test cluster and shut it down while a job with a restart strategy
> is running and you *don't* immediately kill the process and have logging
> enabled, you see that the ExecutionGraph is still attempting to recover
> the job.
>
> What I don't understand is how this even happens when we shut down the
> ExecutorService. Any idea?
>
> Do you think there is another way to prevent this behaviour? I would be
> happy to keep the failure cause as before, but couldn't think of any other
> way.
> ------------------------------
>
> This has been changed as well: a fail will be ignored when the job is
> cancelling or cancelled. That's OK, right?
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/flink/pull/1669/files#r53531860>.
>
> JobManager cancel and clear everything fails jobs instead of cancelling
> -----------------------------------------------------------------------
>
> Key: FLINK-3443
> URL: https://issues.apache.org/jira/browse/FLINK-3443
> Project: Flink
> Issue Type: Bug
> Components: Distributed Runtime
> Reporter: Ufuk Celebi
> Assignee: Ufuk Celebi
>
> When the job manager is shut down, it calls {{cancelAndClearEverything}}.
> This method does not {{cancel}} the {{ExecutionGraph}} instances, but
> {{fail}}s them, which can lead to {{ExecutionGraph}} restart.
> I've noticed this in tests, where old graph got into a loop of restarts.
> What I don't understand is why the futures etc. are not cancelled when the
> executor service is shut down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)