[ 
https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455955#comment-17455955
 ] 

Samuel Lacroix edited comment on FLINK-20195 at 12/8/21, 7:44 PM:
------------------------------------------------------------------

[~trohrmann] We're on 1.13.2.

 

More details : We have ~30 JMs, with 1 job/JM. The duplicated jobs (same id, 
with one of them "suspending") appear at the same time than ZK "lost 
connection" events, which trigger re-elections of the same JMs, so we believe 
it is exactly this issue. It appears with relatively low probabilities (just a 
few JMs at a time), but it can pile up if we don't clean them.

*That's the first issue (job duplication in the UI)*

 

After some time, if we don't clean them (we clean them by stopping and 
redeploying the job btw), some JMs fail with this :
org.apache.flink.util.FlinkException: Could not retrieve checkpoint XXXXXX from 
state handle under /0000000000000XXXXXX. This indicates that the retrieved 
state handle is broken. Try cleaning the state handle store.
Caused by: java.io.FileNotFoundException: File does not exist: 
/XXX/YYY/completedCheckpoint94b37b00554d
*That's the second issue.* This ZK key should not exist, it lives along another 
one, like this :

/default/<jobid>/0000000000000XXXXXX  <= "ghost" key, referencing a 
non-existent checkpoint

/default/<jobid>/0000000000000YYYYYY <= good key, referencing the last 
checkpoint of the running job

So it seems the job duplication has an impact on ZK (and not only the REST API/ 
UI), preventing the JMs to restart. We still haven't understood why. It can be 
fixed (temporarily) by removing the "ghost" key : then the JM restart 
successfully and the job is restored properly.


was (Author: keatspeeks):
[~trohrmann] We're on 1.13.2.

 

More details : We have ~30 JMs, with 1 job/JM. The duplicated jobs (same id, 
with one of them "suspending") appear at the same time than ZK lost 
connections, which triggers re-election of the same JM, so we believe it is 
exactly this issue. It appears with relatively low probabilities (just a few 
JMs at a time), but it can pile up if we don't clean them.

*That's the first issue (job duplication in the UI)*

 

After some time, if we don't clean them (we clean them by stopping and 
redeploying the job btw), some JMs fail with this :
org.apache.flink.util.FlinkException: Could not retrieve checkpoint XXXXXX from 
state handle under /0000000000000XXXXXX. This indicates that the retrieved 
state handle is broken. Try cleaning the state handle store.
Caused by: java.io.FileNotFoundException: File does not exist: 
/XXX/YYY/completedCheckpoint94b37b00554d
*That's the second issue.* This ZK key should not exist, it lives along another 
one, like this :

/default/<jobid>/0000000000000XXXXXX  <= "ghost" key, referencing a 
non-existent checkpoint

/default/<jobid>/0000000000000YYYYYY <= good key, referencing the last 
checkpoint of the running job

So it seems the job duplication has an impact on ZK (and not only the REST API/ 
UI), preventing the JMs to restart. We still haven't understood why. It can be 
fixed (temporarily) by removing the "ghost" key : then the JM restart 
successfully and the job is restored properly.

> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
>                 Key: FLINK-20195
>                 URL: https://issues.apache.org/jira/browse/FLINK-20195
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.11.2
>            Reporter: Ingo Bürk
>            Priority: Critical
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after 
> it has been cancelled. This occurred in Ververica Platform after canceling a 
> job (using PATCH /jobs/\{jobId}) and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~ 
> every 0.5s) to log the responses of GET /jobs and got this:
>  
>  
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>  
> You can see in in between that for just a moment, the endpoint returned the 
> same Job ID twice.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to