[
https://issues.apache.org/jira/browse/FLINK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916844#comment-16916844
]
Thomas Weise edited comment on FLINK-10030 at 8/27/19 4:29 PM:
---------------------------------------------------------------
We have also observed this problem across Flink 1.4.x, 1.5.x and still on 1.8.
In our setup we have a single job manager (no standby). Job cancellation can
leave behind the ZK entry, while the file was already removed from
high-availability.storageDir. This leads to a situation where the job manager
fails to start or complete leader election because it cannot recover the job
(that was supposed to be cancelled). It appears that the ZK entry should be
removed before anything else so that there is no possibility that an attempt is
made to recover a cancelled job.
was (Author: thw):
We have also observed this problem across Flink 1.4.x, 1.5.x and still on 1.8.
Job cancellation can leave behind the ZK entry, while the file was already
removed from high-availability.storageDir. This leads to a situation where the
job manager fails to start or complete leader election because it cannot
recover the job (that was supposed to be cancelled). It appears that the ZK
entry should be removed before anything else so that there is no possibility
that an attempt is made to recover a cancelled job.
> zookeeper jobgraphs job info cannot be removed when the job is cancelled with
> zk ha mode
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-10030
> URL: https://issues.apache.org/jira/browse/FLINK-10030
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.5.0
> Reporter: qiang.li
> Priority: Major
>
> flink 1.5 with zk ha mode,when a job is cancelled,if you restart the
> cluster,the jobmanager will fail because of missing the blob data. I find
> that the information about the job in zk node jobgraphs cannot be removed
> due to the standby jobmanager lock the node.I think that standby jobmanager
> should not be watch the jobgraphs node.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)