[
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangyu Zhu resolved FLINK-10133.
---------------------------------
Resolution: Fixed
Fix Version/s: 1.6.1
> finished job's jobgraph never been cleaned up in zookeeper for standalone
> clusters (HA mode with multiple masters)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
> Issue Type: Bug
> Components: JobManager
> Affects Versions: 1.5.0, 1.5.2, 1.6.0
> Reporter: Xiangyu Zhu
> Priority: Major
> Fix For: 1.6.1
>
> Attachments: client.log, namenode.log, standalonesession.log,
> zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as
> following:
> * hadoop hdfs: node1 as namenode, node2,3 as datanode
> * zookeeper: node1-3 as a quorum (but also tried node1 alone)
> * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that.
> However we observe that whenever we submitted a job and it finished (via
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node
> under zookeeper is still there, with a uuid style lock node inside it. From
> the debug node in zookeeper we observed something like "cannot be deleted
> because non empty". Because of this, as long as a job is finished and the
> jobgraph node persists, if restart the clusters or kill one manager (to test
> HA mode), it tries to recover a finished job and couldn't find blob data
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can
> be deleted successfully. If the jobgraphs is clean, killing one job manager
> makes another stand-by JM raised as leader, so it is only this jobgraphs
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens
> every time for finished job (we only tested with wordcount.jar though). I'm
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)