[
https://issues.apache.org/jira/browse/FLINK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luca Castelli updated FLINK-37320:
----------------------------------
Summary: [Observer] FINISHED finite streaming jobs incorrectly being set to
RECONCILING (was: FINISHED finite streaming jobs incorrectly being set to
RECONCILING)
> [Observer] FINISHED finite streaming jobs incorrectly being set to RECONCILING
> ------------------------------------------------------------------------------
>
> Key: FLINK-37320
> URL: https://issues.apache.org/jira/browse/FLINK-37320
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.10.0
> Environment: I've attached the flinkdeployment CR and operator-config
> I used to locally replicate.
> Reporter: Luca Castelli
> Priority: Minor
> Labels: pull-request-available
> Attachments: operator-config.yaml,
> operator-log-finite-streaming-job.log, test-finite-streaming-job.yaml
>
>
> Hello,
> I believe I've found a bug within the observation logic for finite streaming
> jobs. This is a follow-up to:
> [https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
> *For finite streaming jobs:*
> # The job finishes successfully and the job status changes to FINISHED
> # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the
> JM deployments and clears HA configmap data
> # On the next loop, the observer sees MISSING JM and changes the job status
> from FINISHED to RECONCILING
> The job had reached a terminal state. It shouldn't have been set back to
> RECONCILING.
> This leads to an operator error later when a recovery attempt is triggered.
> The recovery is triggered because the JM is MISSING, the status is
> RECONCILING, spec shows RUNNING, and HA enabled. The recovery fails with
> validateHaMetadataExists throwing UpgradeFailureException.
> At that point the deployment gets stuck in a loop with status RECONCILING and
> UpgradeFailureException thrown on each cycle. I've attached operator logs
> showing this.
> *Proposed solution:* I think the fix would be to wrap
> [AbstractFlinkDeploymentObserver.observeJmDeployment|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
> in an if-statement that checks the job is not in a terminal state. Happy to
> discuss and/or put up the 2 line code change PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)