[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

Thomas Wozniakowski (JIRA) Tue, 21 Aug 2018 00:53:33 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587119#comment-16587119
 ]


Thomas Wozniakowski edited comment on FLINK-10184 at 8/21/18 7:52 AM:
----------------------------------------------------------------------

Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove)}}


was (Author: jamalarm):
Hey [~wcummings],

I'm not 100% sure what is wrong, but I believe a good starting point would be 
{{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}}
or
{{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove(java.lang.String,
 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.RemoveCallback<T>)}}

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-10184
>                 URL: https://issues.apache.org/jira/browse/FLINK-10184
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.2
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

Reply via email to