[ 
https://issues.apache.org/jira/browse/FLINK-38049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045520#comment-18045520
 ] 

Lucas Borges commented on FLINK-38049:
--------------------------------------

I think we've hit this issue again: 
What's happening is, in cases of upgrades where the flink deployment is using 
stateless upgrade mode but HA is still enabled, the operator tries to validate 
if the HA metadata is there, which causes it to fail with:

```
>>> Event[Job] | Warning | RESTOREFAILED | HA metadata not available to restore 
>>> from last state. It is possible that the job has finished or terminally 
>>> failed, or the configmaps have been deleted.
```

This happens on every other job upgrade when both stateless and HA are enabled.

> UpgradeMode falls back to last-state mode on HA resubmission even for 
> stateless
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-38049
>                 URL: https://issues.apache.org/jira/browse/FLINK-38049
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.11.0, kubernetes-operator-1.12.0
>            Reporter: Lucas Borges
>            Priority: Major
>
> From this [documentation 
> page,:|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades]
> > _* When HA is enabled the {{savepoint}} upgrade mode may fall back to the 
> > {{last-state}} behaviour in cases where the job is in an unhealthy state._
> Which leads me to believe that, the upgrade mode may fall back to last-state 
> if HA is enabled *and* the previous mode was savepoint. In practice though, 
> if HA is enabled, the job resubmission will fall back to last-state 
> regardless of the previous upgrade mode. This lead to a situation where we 
> launched a job with stateless mode and HA, which after being ulhealthy, got 
> resubmitted by the operator on last-state mode (unexpected?).
> I think this is a bug, and that[this 
> condition|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java#L577-L579]
>  should also check if the upgrade mode is not stateless.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to