[
https://issues.apache.org/jira/browse/FLINK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-32774.
------------------------------
Fix Version/s: kubernetes-operator-1.7.0
Assignee: Gyula Fora (was: Maximilian Michels)
Resolution: Fixed
merged to main: 48d6703fa1f795e9849a3e690264cc5f6273349c
release-1.6: 575ea323f09a437cf9f483968588ea77c5a98835
> Reconciliation for autoscaling overrides gets stuck after
> cancel-with-savepoint
> -------------------------------------------------------------------------------
>
> Key: FLINK-32774
> URL: https://issues.apache.org/jira/browse/FLINK-32774
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.0
> Reporter: Maximilian Michels
> Assignee: Gyula Fora
> Priority: Critical
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.6.0, kubernetes-operator-1.7.0
>
>
> Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not
> rely on the Flink configuration anymore to store the parallelism overrides.
> Instead, it stores them internally in the autoscaler config map. Upon
> scalings without the rescaling API, the spec is changed on the fly during
> reconciliation and the parallelism overrides are added.
> Unfortunately, this yields to the cluster getting stuck with the job in
> FINISHED state after taking a savepoint for upgrade. The operator assumes
> that the new cluster got deployed successfully and goes into DEPLOYED state
> again.
> Log flow (from oldest to newest):
> # Rescheduling new reconciliation immediately to execute scaling operation.
> # Upgrading/Restarting running job, suspending first...
> # Job is in running state, ready for upgrade with SAVEPOINT
> # Suspending existing deployment.
> # Suspending job with savepoint.
> # Job successfully suspended with savepoint
> # The resource is being upgraded
> # Pending upgrade is already deployed, updating status.
> # Observing JobManager deployment. Previous status: DEPLOYING
> # JobManager deployment port is ready, waiting for the Flink REST API...
> # DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not
> yet considered to be stable and might be rolled back in the future
> It appears the issue might be in (8):
> [https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260]
> because the generation id hasn't been changed by the mere parallelism
> override change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)