[ https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643919#comment-17643919 ]
Alexis Sarda-Espinosa commented on FLINK-30305: ----------------------------------------------- All right, understood. Then, forgetting that entirely, the scenario would be the following: # {{upgradeMode=savepoint}} # HA metadata unavailable # Job in unhealthy state The operator currently gets stuck "waiting for upgradeable state", so it's not possible to alter the {{FlinkDeployment}} resource anymore. * What are the limitations that prevent the operator from allowing resource modifications in this case? * If conditions 1 and 2 are given, is the root cause for 3 relevant? I would imagine that the critical question for the operator would be: how to be sure that the job is unealthy and cannot recover on its own? Finding the answer could use sophisticated methods: we could check {{unavailableReplicas}} in the JM Deployment's status, maybe also check {{restartCount}} of {{flink-main-container}} in the JM pod(s), etc. Alternatively, the method could be very simple: regardless of automatic-rollback configuration, wait {{kubernetes.operator.deployment.readiness.timeout}}. If the job doesn't reach a healthy state, allow further spec changes while still respecting {{lastSavepoint}} and {{savepointHistory}}. Do you see big problems with these ideas? > Operator deletes HA metadata during stateful upgrade, preventing potential > manual rollback > ------------------------------------------------------------------------------------------ > > Key: FLINK-30305 > URL: https://issues.apache.org/jira/browse/FLINK-30305 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.2.0 > Reporter: Alexis Sarda-Espinosa > Priority: Major > > I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade > mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the > operator. After the job was running, I purposely created an erroneous spec by > changing my pod template to include an entry in {{envFrom -> secretRef}} with > a name that doesn't exist. Schema validation passed, so the operator tried to > upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, > and I see this in the operator logs: > {noformat} > >>> Status | Info | UPGRADING | The resource is being upgraded > Deleting deployment with terminated application before new deployment > Deleting JobManager deployment and HA metadata. > {noformat} > Afterwards, even if I remove the non-existing entry from my pod template, the > operator can no longer propagate the new spec because "Job is not running yet > and HA metadata is not available, waiting for upgradeable state". -- This message was sent by Atlassian Jira (v8.20.10#820010)