[jira] [Comment Edited] (FLINK-10030) zookeeper jobgraphs job info cannot be removed when the job is cancelled with zk ha mode

Thomas Weise (Jira) Tue, 27 Aug 2019 09:30:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916844#comment-16916844
 ]


Thomas Weise edited comment on FLINK-10030 at 8/27/19 4:29 PM:
---------------------------------------------------------------

We have also observed this problem across Flink 1.4.x, 1.5.x and still on 1.8. 
In our setup we have a single job manager (no standby). Job cancellation can 
leave behind the ZK entry, while the file was already removed from 
high-availability.storageDir. This leads to a situation where the job manager 
fails to start or complete leader election because it cannot recover the job 
(that was supposed to be cancelled). It appears that the ZK entry should be 
removed before anything else so that there is no possibility that an attempt is 
made to recover a cancelled job.


was (Author: thw):
We have also observed this problem across Flink 1.4.x, 1.5.x and still on 1.8. 
Job cancellation can leave behind the ZK entry, while the file was already 
removed from high-availability.storageDir. This leads to a situation where the 
job manager fails to start or complete leader election because it cannot 
recover the job (that was supposed to be cancelled). It appears that the ZK 
entry should be removed before anything else so that there is no possibility 
that an attempt is made to recover a cancelled job.

> zookeeper jobgraphs job info cannot be removed when the job is cancelled with 
> zk ha mode
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-10030
>                 URL: https://issues.apache.org/jira/browse/FLINK-10030
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.0
>            Reporter: qiang.li
>            Priority: Major
>
> flink 1.5 with zk ha mode,when a job is cancelled,if you restart the 
> cluster,the jobmanager will fail because of missing the blob data. I find 
> that  the information about the job in zk node jobgraphs cannot be removed 
> due to the standby jobmanager lock the node.I think that standby jobmanager 
> should not be watch the jobgraphs node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (FLINK-10030) zookeeper jobgraphs job info cannot be removed when the job is cancelled with zk ha mode

Reply via email to