[jira] [Commented] (FLINK-10030) zookeeper jobgraphs job info cannot be removed when the job is cancelled with zk ha mode

Thomas Weise (Jira) Tue, 27 Aug 2019 09:27:28 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916844#comment-16916844
 ]


Thomas Weise commented on FLINK-10030:
--------------------------------------

We have also observed this problem across Flink 1.4.x, 1.5.x and still on 1.8. 
Job cancellation can leave behind the ZK entry, while the file was already 
removed from high-availability.storageDir. This leads to a situation where the 
job manager fails to start or complete leader election because it cannot 
recover the job (that was supposed to be cancelled). It appears that the ZK 
entry should be removed before anything else so that there is no possibility 
that an attempt is made to recover a cancelled job.

> zookeeper jobgraphs job info cannot be removed when the job is cancelled with 
> zk ha mode
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-10030
>                 URL: https://issues.apache.org/jira/browse/FLINK-10030
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.0
>            Reporter: qiang.li
>            Priority: Major
>
> flink 1.5 with zk ha mode,when a job is cancelled,if you restart the 
> cluster,the jobmanager will fail because of missing the blob data. I find 
> that  the information about the job in zk node jobgraphs cannot be removed 
> due to the standby jobmanager lock the node.I think that standby jobmanager 
> should not be watch the jobgraphs node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-10030) zookeeper jobgraphs job info cannot be removed when the job is cancelled with zk ha mode

Reply via email to