[jira] [Closed] (FLINK-32774) Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

Gyula Fora (Jira) Wed, 09 Aug 2023 08:34:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyula Fora closed FLINK-32774.
------------------------------
    Fix Version/s: kubernetes-operator-1.7.0
         Assignee: Gyula Fora  (was: Maximilian Michels)
       Resolution: Fixed

merged to main: 48d6703fa1f795e9849a3e690264cc5f6273349c
release-1.6: 575ea323f09a437cf9f483968588ea77c5a98835

> Reconciliation for autoscaling overrides gets stuck after 
> cancel-with-savepoint
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-32774
>                 URL: https://issues.apache.org/jira/browse/FLINK-32774
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.0
>            Reporter: Maximilian Michels
>            Assignee: Gyula Fora
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.6.0, kubernetes-operator-1.7.0
>
>
> Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not 
> rely on the Flink configuration anymore to store the parallelism overrides. 
> Instead, it stores them internally in the autoscaler config map. Upon 
> scalings without the rescaling API, the spec is changed on the fly during 
> reconciliation and the parallelism overrides are added.
> Unfortunately, this yields to the cluster getting stuck with the job in 
> FINISHED state after taking a savepoint for upgrade. The operator assumes 
> that the new cluster got deployed successfully and goes into DEPLOYED state 
> again.
> Log flow (from oldest to newest):
>  # Rescheduling new reconciliation immediately to execute scaling operation.
>  # Upgrading/Restarting running job, suspending first...
>  # Job is in running state, ready for upgrade with SAVEPOINT
>  # Suspending existing deployment.
>  # Suspending job with savepoint.
>  # Job successfully suspended with savepoint
>  # The resource is being upgraded
>  # Pending upgrade is already deployed, updating status.
>  # Observing JobManager deployment. Previous status: DEPLOYING
>  # JobManager deployment port is ready, waiting for the Flink REST API...
>  # DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not 
> yet considered to be stable and might be rolled back in the future
> It appears the issue might be in (8): 
> [https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260]
>  because the generation id hasn't been changed by the mere parallelism 
> override change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-32774) Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

Reply via email to