[
https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis Sarda-Espinosa closed FLINK-30305.
-----------------------------------------
Resolution: Later
> Operator deletes HA metadata during stateful upgrade, preventing potential
> manual rollback
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-30305
> URL: https://issues.apache.org/jira/browse/FLINK-30305
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.2.0
> Reporter: Alexis Sarda-Espinosa
> Priority: Major
>
> I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade
> mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the
> operator. After the job was running, I purposely created an erroneous spec by
> changing my pod template to include an entry in {{envFrom -> secretRef}} with
> a name that doesn't exist. Schema validation passed, so the operator tried to
> upgrade the job, but the new pod hangs with {{CreateContainerConfigError}},
> and I see this in the operator logs:
> {noformat}
> >>> Status | Info | UPGRADING | The resource is being upgraded
> Deleting deployment with terminated application before new deployment
> Deleting JobManager deployment and HA metadata.
> {noformat}
> Afterwards, even if I remove the non-existing entry from my pod template, the
> operator can no longer propagate the new spec because "Job is not running yet
> and HA metadata is not available, waiting for upgradeable state".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)