[
https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412603#comment-17412603
]
Chesnay Schepler edited comment on FLINK-20195 at 9/9/21, 2:14 PM:
-------------------------------------------------------------------
When a job is suspended (or terminates in any other way) then information about
the job is stored in the Dispatchers executionGraphStore.
This leads to duplication if the same JM is re-elected as the leader; it
restarts the job and then we have one entry for the suspended job in the
executionGraphStore, and an active JobMaster for the running job.
If another JM is elected then this issue is not visible because the
executionGraphStore is not persisted across Dispatchers.
I'm not sure if there is a clean solution to the problem (that doesn't require
a lot of work). We do want suspended jobs to be added to the store so that they
are still accessible while the job is suspended. Removing that entry when the
job restarts makes somewhat sense and would fix the issue.
was (Author: zentol):
When a job is suspended (or terminates in any other way) then information about
the job is stored in the Dispatchers executionGraphStore.
This leads to duplication if the same JM is re-elected as the leader; it
restarts the job and then we have one entry for the suspended job in the
executionGraphStore, and an active JobMaster for the running job.
If another JM is elected then this issue is not visible because the
executionGraphStore is not persisted across Dispatchers.
I'm not sure if there is a clean solution to the problem. We do want suspended
jobs to be added to the store so that they are still accessible while the job
is suspended. Removing that entry when the job restarts makes somewhat sense
and would fix the issue.
> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
> Key: FLINK-20195
> URL: https://issues.apache.org/jira/browse/FLINK-20195
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / REST
> Affects Versions: 1.11.2
> Reporter: Ingo Bürk
> Priority: Minor
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after
> it has been cancelled. This occurred in Ververica Platform after canceling a
> job (using PATCH /jobs/\{jobId}) and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~
> every 0.5s) to log the responses of GET /jobs and got this:
>
>
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>
> You can see in in between that for just a moment, the endpoint returned the
> same Job ID twice.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)