[
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148478#comment-15148478
]
ASF GitHub Bot commented on FLINK-3396:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/1633#discussion_r52998567
--- Diff:
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
---
@@ -1073,57 +1073,73 @@ class JobManager(
// execute the recovery/writing the jobGraph into the
SubmittedJobGraphStore asynchronously
// because it is a blocking operation
future {
- try {
- if (isRecovery) {
- executionGraph.restoreLatestCheckpointedState()
- }
- else {
- val snapshotSettings = jobGraph.getSnapshotSettings
- if (snapshotSettings != null) {
- val savepointPath = snapshotSettings.getSavepointPath()
+ val restoreStateSuccess =
+ try {
+ if (isRecovery) {
+ executionGraph.restoreLatestCheckpointedState()
--- End diff --
The behaviour right now for a failure while doing a job recovery would
simply fail the `ExecutionGraph` triggering a restart. A successful job
recovery would send a `JobSubmitSuccess` to the client. I'm not sure whether
this is actually correct, since the client already received a
`JobSubmitMessage` from the `JobManager` while initially submitting the job.
But I think this will simply be ignored.
Thus, suppressing the restart behaviour in case of a job recovery would
actually change the behaviour.
If it makes sense and if it is possible to recover from failures while
recovering a job or restoring a savepoint, it would make sense to not directly
fail the job without restarting. Maybe one should distinguish that based on the
actually occurring exception.
> Job submission Savepoint restore logic flawed
> ---------------------------------------------
>
> Key: FLINK-3396
> URL: https://issues.apache.org/jira/browse/FLINK-3396
> Project: Flink
> Issue Type: Bug
> Reporter: Ufuk Celebi
> Assignee: Ufuk Celebi
> Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or
> failure in any case. With savepoint restore failures, the ack message will be
> skipped.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)