[
https://issues.apache.org/jira/browse/FLINK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557513#comment-17557513
]
Aitozi edited comment on FLINK-28187 at 6/22/22 3:19 PM:
---------------------------------------------------------
I'm afraid of not clearly expressing my meaning. I will try to give an example
about what I think:
1. Submit the job with {{Generation1}} , and JobID is generated
{{ns/name@Generation1}}
2. The submission timeout but actually succeed and the last reconcile spec not
updated
3. User change the spec and the generation become {{Generation2}} (Before the
observer have sync the job status and update the last reconcile spec)
4. The observer observe the job with JobID {{ns/name@Generation2}} not match
the first job
5. The reconciler reconcile to submit the job with {{Generation2}}.
In this sequence, the job {{ns/name@Generation1}} will be orphaned.
was (Author: aitozi):
I'm afraid of not clearly expressing my meaning. I will try to give an example
about what I think:
1. Submit the job with {{Generation1}} , and JobID is generated
{{ns/name@Generation1}}
2. The submit timeout but actually succeed and the last reconcile spec not
updated
3. User change the spec and the generation become {{Generation2}} (Before the
observer have sync the job status and update the last reconcile spec)
4. The observer observe the job with JobID {{ns/name@Generation2}} not match
the first job
5. The reconciler reconcile to submit the job with {{Generation2}}.
In this sequence, the job {{ns/name@Generation1}} will be orphaned.
> Duplicate job submission for FlinkSessionJob
> --------------------------------------------
>
> Key: FLINK-28187
> URL: https://issues.apache.org/jira/browse/FLINK-28187
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.0.0
> Reporter: Jeesmon Jacob
> Priority: Critical
> Attachments: flink-operator-log.txt
>
>
> During a session job submission if a deployment error (ex:
> concurrent.TimeoutException) is hit, operator will submit the job again. But
> first submission could have succeeded in jobManager side and second
> submission could result in duplicate job. Operator log attached.
> Per [~gyfora]:
> The problem is that in case a deployment error was hit, the
> SessionJobObserver will not be able to tell whether it has submitted the job
> or not. So it will simply try to submit it again. We have to find a mechanism
> to correlate Jobs on the cluster with the SessionJob CR itself. Maybe we
> could override the job name itself for this purpose or something like that.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)