[
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587293#comment-16587293
]
Thomas Wozniakowski commented on FLINK-10184:
---------------------------------------------
I'm just combing through the Zookeeper logs to see if there's anything that
might be helpful. I'm going to dump anything that looks a bit odd here:
{quote}
2018-08-21 10:27:05,657 [myid:160] - INFO [ProcessThread(sid:160
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when
processing sessionid:0x750000066362001c type:create cxid:0x6 zxid:0x2000000fe
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_one/leaderlatch/rest_server_lock Error:KeeperErrorCode =
NoNode for /flink/cluster_one/leaderlatch/rest_server_lock
2018-08-21 10:27:05,938 [myid:160] - INFO [ProcessThread(sid:160
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when
processing sessionid:0x750000066362001c type:create cxid:0x24 zxid:0x200000104
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_one/leaderlatch/resource_manager_lock Error:KeeperErrorCode
= NoNode for /flink/cluster_one/leaderlatch/resource_manager_lock
2018-08-21 10:27:05,944 [myid:160] - INFO [ProcessThread(sid:160
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when
processing sessionid:0x750000066362001c type:create cxid:0x29 zxid:0x200000105
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_one/leaderlatch/dispatcher_lock Error:KeeperErrorCode =
NoNode for /flink/cluster_one/leaderlatch/dispatcher_lock
{quote}
{quote}
2018-08-21 10:28:35,032 [myid:160] - INFO [ProcessThread(sid:160
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when
processing sessionid:0x750000066362001c type:create cxid:0xde zxid:0x200000145
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/0000000000000000061/ada19912-8c78-4f15-b1ef-f0acc5011559
Error:KeeperErrorCode = NodeExists for
/flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/0000000000000000061/ada19912-8c78-4f15-b1ef-f0acc5011559
2018-08-21 10:28:35,184 [myid:160] - INFO [ProcessThread(sid:160
cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when
processing sessionid:0x750000066362001c type:create cxid:0xe2 zxid:0x200000146
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock
Error:KeeperErrorCode = NoNode for
/flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock
{quote}
> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> ------------------------------------------------------------------------------
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.2, 1.6.0
> Reporter: Thomas Wozniakowski
> Priority: Blocker
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}}
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc.
> You will end up with many job graphs stored in zookeeper, but none of the
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those
> old JobGraph objects from Zookeeper, then goes looking for the corresponding
> blobs in the HA directory. The blobs are not there so the JobManager explodes
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is
> still very much there.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)