[jira] [Commented] (FLINK-26916) The Operator ignores job related changes (jar, parallelism) during last-state upgrades

Gyula Fora (Jira) Tue, 29 Mar 2022 10:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514248#comment-17514248
 ]


Gyula Fora commented on FLINK-26916:
------------------------------------

After some discussion with [~thw] it seems there is an inherent limitation in 
the Kubernetes HA service that makes this nearly impossible to implement.

The Kuberntes HA service only stores a file pointer within the kubernetes HA 
storage dir. In order to get the actual checkpoint pointer one would actually 
need to read the CompletedCheckpoint object from the HA storage dir and get it 
from that.

This would require access to the HA storage from within the operator which is 
completely unfeasible. 

We suggest to change the Flink Kubernetes HA Service implementation to store 
the external checkpoint pointer also in the same configmap. This could be a 
minimal backward compatible change that we should aim to get in for 1.15 and if 
simple enough backport for the next 1.14 release.

Due to these inherent limitations we propose to add a big fat warning to the 
last-state upgrade mode and point out that job changes are not possible and 
accept this as a limitation for the preview release.

 [~wangyang0918]  : Yang you are familiar with the Kubernetes HA 
implementation, do you think we can reasonably make this change for 1.15? What 
is your gut feeling?

> The Operator ignores job related changes (jar, parallelism) during last-state 
> upgrades
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-26916
>                 URL: https://issues.apache.org/jira/browse/FLINK-26916
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-0.1.0, kubernetes-operator-1.0.0
>            Reporter: Matyas Orhidi
>            Assignee: Gyula Fora
>            Priority: Critical
>
> RC: The old jobgraph is being reused when resuming



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26916) The Operator ignores job related changes (jar, parallelism) during last-state upgrades

Reply via email to