[jira] [Commented] (FLINK-26930) Rethink last-state upgrade implementation in flink-kubernetes-operator

Yang Wang (Jira) Fri, 24 Jun 2022 04:05:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558435#comment-17558435
 ]


Yang Wang commented on FLINK-26930:
-----------------------------------

Storing the checkpoint in the JRS is also necessary because the entry read from 
the ConfigMap might not be newest. So the whole story works like following.

If the JRS entry exist (i.e. the job terminated globally), we rely on the 
latest checkpoint there. If the JRS doesn’t exist, which usually means the 
JobManager continuously crashed backoff and the job could not terminate 
globally, the HA related ConfigMap should still exist. In that case, we rely on 
the ConfigMap entry with the pure text checkpoint path.

 

However, I still hesitate to store the last checkpoint path in pure text format 
to the HA store.

> Rethink last-state upgrade implementation in flink-kubernetes-operator
> ----------------------------------------------------------------------
>
>                 Key: FLINK-26930
>                 URL: https://issues.apache.org/jira/browse/FLINK-26930
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Yang Wang
>            Priority: Major
>
> Following the discussion in FLINK-26916.
>  
> How the last-state upgrade works now?
> First, delete the Flink cluster directly with HA ConfigMap retained. This 
> leaves job in a "SUSPENDED" state. Then flink-kubernetes-operator will deploy 
> a new Flink application with same cluster-id so that it could recover from 
> the latest checkpoint. Please note that before starting the application, 
> JobGraph will be deleted from the HA ConfigMap. This is to ensure the newly 
> changed job options could take effect.
>  
> Solution 1: Extend the JRS so the stored job result contains list of retained 
> checkpoints. This of course implies that cluster gets shut down / job gets 
> terminated properly (other cases should be used for fail-over scenarios only).
>  
> Solution 2: Store the last checkpoint path in the Kubernetes HA ConfigMap. 
> This could be a minimal backward compatible change that we could backport to 
> release-1.15/release-1.14.
>  
> As soon as there is a straightforward way of accessing the last checkpoint, 
> we should improve the current implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (FLINK-26930) Rethink last-state upgrade implementation in flink-kubernetes-operator

Reply via email to