[
https://issues.apache.org/jira/browse/FLINK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-26930:
-------------------------------
Priority: Minor (was: Major)
> Rethink last-state upgrade implementation in flink-kubernetes-operator
> ----------------------------------------------------------------------
>
> Key: FLINK-26930
> URL: https://issues.apache.org/jira/browse/FLINK-26930
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Yang Wang
> Priority: Minor
>
> Following the discussion in FLINK-26916.
>
> How the last-state upgrade works now?
> First, delete the Flink cluster directly with HA ConfigMap retained. This
> leaves job in a "SUSPENDED" state. Then flink-kubernetes-operator will deploy
> a new Flink application with same cluster-id so that it could recover from
> the latest checkpoint. Please note that before starting the application,
> JobGraph will be deleted from the HA ConfigMap. This is to ensure the newly
> changed job options could take effect.
>
> Solution 1: Extend the JRS so the stored job result contains list of retained
> checkpoints. This of course implies that cluster gets shut down / job gets
> terminated properly (other cases should be used for fail-over scenarios only).
>
> Solution 2: Store the last checkpoint path in the Kubernetes HA ConfigMap.
> This could be a minimal backward compatible change that we could backport to
> release-1.15/release-1.14.
>
> As soon as there is a straightforward way of accessing the last checkpoint,
> we should improve the current implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)