[
https://issues.apache.org/jira/browse/FLINK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495405#comment-17495405
]
Yang Wang edited comment on FLINK-26261 at 2/21/22, 9:45 AM:
-------------------------------------------------------------
Maybe we should verify whether the JobManager pod status is running before
building a Flink rest client to get job status.
If the JobManager pod could not be launched in a given timeout(e.g. 600s), I
think it is reasonable we could suspend the job the forward the pod events to
FlinkDeployment.
was (Author: fly_in_gis):
Maybe we should verify whether the JobManager pod status is running before
building a Flink rest client to get job status.
If the JobManager pod could not be launched in a given timeout(e.g. 600s), then
we could suspend the job the forward the pod events to FlinkDeployment.
> Reconciliation should try to start job when not already started or move to
> permanent error
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-26261
> URL: https://issues.apache.org/jira/browse/FLINK-26261
> Project: Flink
> Issue Type: Sub-task
> Components: Kubernetes Operator
> Reporter: Thomas Weise
> Priority: Major
>
> When job submission fails, the operator currently keeps trying to find the
> job status. In the case I'm looking at the cluster wasn't created because the
> image could not be resolved. We either need the logic to re-attempt job
> submission or flag the submission as failed so that JobStatusObserver does
> not attempt to check again. We should also capture the submission error as
> event on the CR.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)