[ 
https://issues.apache.org/jira/browse/FLINK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499966#comment-17499966
 ] 

Gyula Fora commented on FLINK-26345:
------------------------------------

 [~nicholasjiang]  I dont think Thomas is working on this, however as an 
accidental "side effect" it seems like it is partly covered by this 
restructuring effort: 
[https://github.com/apache/flink-kubernetes-operator/pull/26]

With this now we always observe the jobmanager deployment (and job) status 
before anything else so in practice that would detect anything that was running.

However the above changes alone do not fix the problems, we would still get 
excpetions I think but it will be easier to fix I beleive :) 

I would appreciate if you could review the PR and base this work on top of that

> Observer should detect flink job even if deployment status is empty
> -------------------------------------------------------------------
>
>                 Key: FLINK-26345
>                 URL: https://issues.apache.org/jira/browse/FLINK-26345
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>
> Currently it is possible to get into a cornercase where the job is submitted 
> by the reconciler but the deployment status is not updated to reflect the 
> submission.
> In these cases the observer does not attempt to "recover" the cluster, it 
> simply skips the observation step, thinking that the job is not running 
> (status == null).
> However this means that the reconciler will try to submit it again leading to 
> the error:
> {code:java}
> org.apache.flink.client.deployment.ClusterDeploymentException: The Flink 
> cluster job-name already exists.                             
>      at 
> org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:179)
>                      
>      at 
> org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
>                       
>      at 
> org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:73)
>                                   
>      at 
> org.apache.flink.kubernetes.operator.reconciler.JobReconciler.deployFlinkJob(JobReconciler.java:123)
>                                       
>      at 
> org.apache.flink.kubernetes.operator.reconciler.JobReconciler.reconcile(JobReconciler.java:65)
>                                                      
>      at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcileFlinkDeployment(FlinkDeploymentController.java:126)
>           
>      at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:102)
>                            
>      at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:51)
>                               
>  {code}
> This is somewhat related to FLINK-26261, cc [~thw] 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to