[ 
https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455955#comment-17455955
 ] 

Samuel Lacroix commented on FLINK-20195:
----------------------------------------

[~trohrmann] We're on 1.13.2.

 

More details : We have ~30 JMs, which one job/JM. The duplicated jobs (same id, 
with one of them "suspending") appear at the same time than ZK lost 
connections, which triggers re-election of the same JM, so we believe it is 
exactly this issue. It appears with relatively low probabilities (just a few 
JMs at a time), but it can pile up if we don't clean them.

*That's the first issue (job duplication in the UI)*

 

After some time, if we don't clean them (we clean them by stopping and 
redeploying the job btw), some JMs fail with this :
org.apache.flink.util.FlinkException: Could not retrieve checkpoint XXXXXX from 
state handle under /0000000000000XXXXXX. This indicates that the retrieved 
state handle is broken. Try cleaning the state handle store.
Caused by: java.io.FileNotFoundException: File does not exist: 
/XXX/YYY/completedCheckpoint94b37b00554d
*That's the second issue.* This ZK key should not exist, it lives along another 
one, like this :

/default/<jobid>/0000000000000XXXXXX  <= "ghost" key, referencing a 
non-existent checkpoint

/default/<jobid>/0000000000000YYYYYY <= good key, referencing the last 
checkpoint of the running job

SO it seems the job duplication has an impact on ZK, preventing the JMs to 
restart. It can be fixed (temporarily) by removing the "ghost" key.

> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
>                 Key: FLINK-20195
>                 URL: https://issues.apache.org/jira/browse/FLINK-20195
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.11.2
>            Reporter: Ingo Bürk
>            Priority: Critical
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after 
> it has been cancelled. This occurred in Ververica Platform after canceling a 
> job (using PATCH /jobs/\{jobId}) and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~ 
> every 0.5s) to log the responses of GET /jobs and got this:
>  
>  
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>  
> You can see in in between that for just a moment, the endpoint returned the 
> same Job ID twice.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to