[ 
https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643919#comment-17643919
 ] 

Alexis Sarda-Espinosa commented on FLINK-30305:
-----------------------------------------------

All right, understood. Then, forgetting that entirely, the scenario would be 
the following:

# {{upgradeMode=savepoint}}
# HA metadata unavailable
# Job in unhealthy state

The operator currently gets stuck "waiting for upgradeable state", so it's not 
possible to alter the {{FlinkDeployment}} resource anymore.

* What are the limitations that prevent the operator from allowing resource 
modifications in this case?
* If conditions 1 and 2 are given, is the root cause for 3 relevant?

I would imagine that the critical question for the operator would be: how to be 
sure that the job is unealthy and cannot recover on its own? Finding the answer 
could use sophisticated methods: we could check {{unavailableReplicas}} in the 
JM Deployment's status, maybe also check {{restartCount}} of 
{{flink-main-container}} in the JM pod(s), etc.

Alternatively, the method could be very simple: regardless of 
automatic-rollback configuration, wait 
{{kubernetes.operator.deployment.readiness.timeout}}. If the job doesn't reach 
a healthy state, allow further spec changes while still respecting 
{{lastSavepoint}} and {{savepointHistory}}.

Do you see big problems with these ideas?

> Operator deletes HA metadata during stateful upgrade, preventing potential 
> manual rollback
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30305
>                 URL: https://issues.apache.org/jira/browse/FLINK-30305
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>
> I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade 
> mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the 
> operator. After the job was running, I purposely created an erroneous spec by 
> changing my pod template to include an entry in {{envFrom -> secretRef}} with 
> a name that doesn't exist. Schema validation passed, so the operator tried to 
> upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, 
> and I see this in the operator logs:
> {noformat}
> >>> Status | Info    | UPGRADING       | The resource is being upgraded
> Deleting deployment with terminated application before new deployment
> Deleting JobManager deployment and HA metadata.
> {noformat}
> Afterwards, even if I remove the non-existing entry from my pod template, the 
> operator can no longer propagate the new spec because "Job is not running yet 
> and HA metadata is not available, waiting for upgradeable state".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to