[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

ASF GitHub Bot (JIRA) Tue, 16 Feb 2016 03:27:40 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148478#comment-15148478
 ]


ASF GitHub Bot commented on FLINK-3396:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1633#discussion_r52998567
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1073,57 +1073,73 @@ class JobManager(
           // execute the recovery/writing the jobGraph into the 
SubmittedJobGraphStore asynchronously
           // because it is a blocking operation
           future {
    -        try {
    -          if (isRecovery) {
    -            executionGraph.restoreLatestCheckpointedState()
    -          }
    -          else {
    -            val snapshotSettings = jobGraph.getSnapshotSettings
    -            if (snapshotSettings != null) {
    -              val savepointPath = snapshotSettings.getSavepointPath()
    +        val restoreStateSuccess =
    +          try {
    +            if (isRecovery) {
    +              executionGraph.restoreLatestCheckpointedState()
    --- End diff --
    
    The behaviour right now for a failure while doing a job recovery would 
simply fail the `ExecutionGraph` triggering a restart. A successful job 
recovery would send a `JobSubmitSuccess` to the client. I'm not sure whether 
this is actually correct, since the client already received a 
`JobSubmitMessage` from the `JobManager` while initially submitting the job. 
But I think this will simply be ignored.
    
    Thus, suppressing the restart behaviour in case of a job recovery would 
actually change the behaviour.
    
    If it makes sense and if it is possible to recover from failures while 
recovering a job or restoring a savepoint, it would make sense to not directly 
fail the job without restarting. Maybe one should distinguish that based on the 
actually occurring exception.


> Job submission Savepoint restore logic flawed
> ---------------------------------------------
>
>                 Key: FLINK-3396
>                 URL: https://issues.apache.org/jira/browse/FLINK-3396
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>             Fix For: 1.0.0
>
>
> When savepoint restoring fails, the thrown Exception fails the execution 
> graph, but the client is not informed about the failure.
> The expected behaviour is that the submission should be acked with success or 
> failure in any case. With savepoint restore failures, the ack message will be 
> skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed

Reply via email to