[jira] [Comment Edited] (FLINK-20195) Jobs endpoint returns duplicated jobs

Chesnay Schepler (Jira) Thu, 09 Sep 2021 07:15:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412603#comment-17412603
 ]


Chesnay Schepler edited comment on FLINK-20195 at 9/9/21, 2:14 PM:
-------------------------------------------------------------------

When a job is suspended (or terminates in any other way) then information about 
the job is stored in the Dispatchers executionGraphStore.

This leads to duplication if the same JM is re-elected as the leader; it 
restarts the job and then we have one entry for the suspended job in the 
executionGraphStore, and an active JobMaster for the running job.

If another JM is elected then this issue is not visible because the 
executionGraphStore is not persisted across Dispatchers.


I'm not sure if there is a clean solution to the problem (that doesn't require 
a lot of work). We do want suspended jobs to be added to the store so that they 
are still accessible while the job is suspended. Removing that entry when the 
job restarts makes somewhat sense and would fix the issue.


was (Author: zentol):
When a job is suspended (or terminates in any other way) then information about 
the job is stored in the Dispatchers executionGraphStore.

This leads to duplication if the same JM is re-elected as the leader; it 
restarts the job and then we have one entry for the suspended job in the 
executionGraphStore, and an active JobMaster for the running job.

If another JM is elected then this issue is not visible because the 
executionGraphStore is not persisted across Dispatchers.


I'm not sure if there is a clean solution to the problem. We do want suspended 
jobs to be added to the store so that they are still accessible while the job 
is suspended. Removing that entry when the job restarts makes somewhat sense 
and would fix the issue.

> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
>                 Key: FLINK-20195
>                 URL: https://issues.apache.org/jira/browse/FLINK-20195
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.11.2
>            Reporter: Ingo Bürk
>            Priority: Minor
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after 
> it has been cancelled. This occurred in Ververica Platform after canceling a 
> job (using PATCH /jobs/\{jobId}) and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~ 
> every 0.5s) to log the responses of GET /jobs and got this:
>  
>  
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>  
> You can see in in between that for just a moment, the endpoint returned the 
> same Job ID twice.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-20195) Jobs endpoint returns duplicated jobs

Reply via email to