[ 
https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643922#comment-17643922
 ] 

Gyula Fora commented on FLINK-30305:
------------------------------------

i think you are missing the crucial problem here. You have to guarantee that a 
job always restores from the very latest checkpoint otherwise some sink/other 
specific mechanisms may fail for the user and can cause duplication or data 
loss.

 

it would be easy to implement some naive logic that restores based on the last 
observed savepoint but that’s dangerous and it’s going to break prod jobs. 
Maybe not for you specifically but we have to consider all use cases.

 

if we submitted a job from the operator as long as it’s running (or we cannot 
determine the exact state) the HA metadata is the only reliable source of 
information of the latest checkpoint. Going around it introduces all kinds of 
tricky cornercases.

 

as I said earlier the only real solution we need is to somehow figure out that 
the JM never started. That would solve all our problems. Relying on the HA 
metadata once the job is submitted is required for correctness.

> Operator deletes HA metadata during stateful upgrade, preventing potential 
> manual rollback
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30305
>                 URL: https://issues.apache.org/jira/browse/FLINK-30305
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>
> I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade 
> mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the 
> operator. After the job was running, I purposely created an erroneous spec by 
> changing my pod template to include an entry in {{envFrom -> secretRef}} with 
> a name that doesn't exist. Schema validation passed, so the operator tried to 
> upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, 
> and I see this in the operator logs:
> {noformat}
> >>> Status | Info    | UPGRADING       | The resource is being upgraded
> Deleting deployment with terminated application before new deployment
> Deleting JobManager deployment and HA metadata.
> {noformat}
> Afterwards, even if I remove the non-existing entry from my pod template, the 
> operator can no longer propagate the new spec because "Job is not running yet 
> and HA metadata is not available, waiting for upgradeable state".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to