Thomas Wozniakowski created FLINK-10184:
-------------------------------------------

             Summary: HA Failover broken due to JobGraphs not being removed 
from Zookeeper on cancel
                 Key: FLINK-10184
                 URL: https://issues.apache.org/jira/browse/FLINK-10184
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.5.2
            Reporter: Thomas Wozniakowski


We have encountered a blocking issue when upgrading our cluster to 1.5.2.

It appears that, when jobs are cancelled manually (in our case with a 
savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} node.

This means that, if you start a job, cancel it, restart it, cancel it, etc. You 
will end up with many job graphs stored in zookeeper, but none of the 
corresponding blobs in the Flink HA directory.

When a HA failover occurs, the newly elected leader retrieves all of those old 
JobGraph objects from Zookeeper, then goes looking for the corresponding blobs 
in the HA directory. The blobs are not there so the JobManager explodes and the 
process dies.

At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
cleared out by hand, and all the jobmanagers restarted.

I can see the following line in the JobManager logs:

{{ 2018-08-20 16:17:20,776 INFO  
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed 
job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
}}

But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is still 
very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to