[jira] [Comment Edited] (FLINK-28187) Duplicate job submission for FlinkSessionJob

Aitozi (Jira) Wed, 22 Jun 2022 04:27:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557379#comment-17557379
 ]


Aitozi edited comment on FLINK-28187 at 6/22/22 11:26 AM:
----------------------------------------------------------

Currently, it generates the JobId in advance to help duplicate the job 
submission 
[link|https://github.com/apache/flink-kubernetes-operator/blob/91753ec5cef1aef85ff3884197e75fa25f7f6625/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java#L215].
If we run the same jobID job, it will throw DuplicateJobSubmissionException 
I think the problem is caused by that if the job submitted failed, it will not 
store the reconcile spec, so the jobId is not stored. And it will regenerate a 
new one to submit


was (Author: aitozi):
Yes, I generate the JobId in advance to help duplicate the job submission 
[link|https://github.com/apache/flink-kubernetes-operator/blob/91753ec5cef1aef85ff3884197e75fa25f7f6625/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java#L215]
 
But, I think the problem is that if the job submitted failed, it will not store 
the reconcile spec, so the jobId is not stored.

> Duplicate job submission for FlinkSessionJob
> --------------------------------------------
>
>                 Key: FLINK-28187
>                 URL: https://issues.apache.org/jira/browse/FLINK-28187
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.0.0
>            Reporter: Jeesmon Jacob
>            Priority: Critical
>         Attachments: flink-operator-log.txt
>
>
> During a session job submission if a deployment error (ex: 
> concurrent.TimeoutException) is hit, operator will submit the job again. But 
> first submission could have succeeded in jobManager side and second 
> submission could result in duplicate job. Operator log attached.
> Per [~gyfora]:
> The problem is that in case a deployment error was hit, the 
> SessionJobObserver will not be able to tell whether it has submitted the job 
> or not. So it will simply try to submit it again. We have to find a mechanism 
> to correlate Jobs on the cluster with the SessionJob CR itself. Maybe we 
> could override the job name itself for this purpose or something like that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (FLINK-28187) Duplicate job submission for FlinkSessionJob

Reply via email to