[
https://issues.apache.org/jira/browse/FLINK-39989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091440#comment-18091440
]
Oleksandr Shulgin commented on FLINK-39989:
-------------------------------------------
Filippo points out that this issue was already reported at:
https://issues.apache.org/jira/browse/FLINK-32631, but got closed with "Cannot
Reproduce".
This time it is clear how to reproduce the problem.
> flinksessionjob stuck on "Job Not Found" if jobmanager terminates before
> operator learns about new job state transitions
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39989
> URL: https://issues.apache.org/jira/browse/FLINK-39989
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.13
> Environment: * Flink version: 2.1.2
> * Flink Kubernetes operator version: 1.13
> Reporter: Filippo Ghibellini
> Priority: Major
>
> h3. Steps to reproduce
> # create a `flinksessionjob` and wait for it to start running
> # scale down the k8s operator to 0 replicas to simulate a delayed
> reconciliation
> # use the Flink web UI to cancel the job (or use any other method to put the
> job in a [globally terminal
> state|https://nightlies.apache.org/flink/flink-docs-stable/docs/internals/job_scheduling/#jobmanager-data-structures]
> # delete all the job manager pods (will be replaced automatically by the k8s
> deployment)
> # scale the k8s operator back to 1 replica - this will resume the
> reconciliation process
> # The `flinksessionjob` k8s entity now reports `Job Not Found`
> It seems that the entire reconciliation process relies heavily on the k8s
> operator learning about job terminations from the job-manager {*}before the
> job-manager restarts{*}.
> A newly started job-manager will not recover jobs that reached a "globally
> terminal state" (since those are not even persisted in the HA state).
> In our case it seems like the trigger for the jobs reaching a globally
> terminal state was us setting `spec.job.state=suspended` on the k8s
> `flinksessionjob` entity i.e. even though in the reproduction steps we cancel
> the job through the UI, the problem can manifest even if the Flink cluster is
> managed exclusively through the k8s operator (it's just harder to reproduce).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)