[ 
https://issues.apache.org/jira/browse/FLINK-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157487#comment-15157487
 ] 

ASF GitHub Bot commented on FLINK-3443:
---------------------------------------

Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1669#discussion_r53673891
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1487,7 +1487,7 @@ class JobManager(
               }
             }
     
    -        eg.fail(cause)
    +        eg.cancel()
    --- End diff --
    
    It is important to make sure that HA recovery continues. The 
`cancelAndClearEverything()` is also triggered in `postStop()` if the actor or 
actor system crashes (one of JobManager failure scenarios, the process reaper 
kills the process then).
    
    We hence need a sort of fail that is not locally retried, but does not lead 
to a terminal state where we run the risk that some cleanup code removes the 
job from ZooKeeper.
    When the job is canceled, don't we run the risk that some code removes the 
ZooKeeper entries and prohibits HA recovery?


> JobManager cancel and clear everything fails jobs instead of cancelling
> -----------------------------------------------------------------------
>
>                 Key: FLINK-3443
>                 URL: https://issues.apache.org/jira/browse/FLINK-3443
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> When the job manager is shut down, it calls {{cancelAndClearEverything}}. 
> This method does not {{cancel}} the {{ExecutionGraph}} instances, but 
> {{fail}}s them, which can lead to {{ExecutionGraph}} restart.
> I've noticed this in tests, where old graph got into a loop of restarts.
> What I don't understand is why the futures etc. are not cancelled when the 
> executor service is shut down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to