[
https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455955#comment-17455955
]
Samuel Lacroix commented on FLINK-20195:
----------------------------------------
[~trohrmann] We're on 1.13.2.
More details : We have ~30 JMs, which one job/JM. The duplicated jobs (same id,
with one of them "suspending") appear at the same time than ZK lost
connections, which triggers re-election of the same JM, so we believe it is
exactly this issue. It appears with relatively low probabilities (just a few
JMs at a time), but it can pile up if we don't clean them.
*That's the first issue (job duplication in the UI)*
After some time, if we don't clean them (we clean them by stopping and
redeploying the job btw), some JMs fail with this :
org.apache.flink.util.FlinkException: Could not retrieve checkpoint XXXXXX from
state handle under /0000000000000XXXXXX. This indicates that the retrieved
state handle is broken. Try cleaning the state handle store.
Caused by: java.io.FileNotFoundException: File does not exist:
/XXX/YYY/completedCheckpoint94b37b00554d
*That's the second issue.* This ZK key should not exist, it lives along another
one, like this :
/default/<jobid>/0000000000000XXXXXX <= "ghost" key, referencing a
non-existent checkpoint
/default/<jobid>/0000000000000YYYYYY <= good key, referencing the last
checkpoint of the running job
SO it seems the job duplication has an impact on ZK, preventing the JMs to
restart. It can be fixed (temporarily) by removing the "ghost" key.
> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
> Key: FLINK-20195
> URL: https://issues.apache.org/jira/browse/FLINK-20195
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / REST
> Affects Versions: 1.11.2
> Reporter: Ingo Bürk
> Priority: Critical
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after
> it has been cancelled. This occurred in Ververica Platform after canceling a
> job (using PATCH /jobs/\{jobId}) and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~
> every 0.5s) to log the responses of GET /jobs and got this:
>
>
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>
> You can see in in between that for just a moment, the endpoint returned the
> same Job ID twice.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)