[jira] [Commented] (FLINK-26930) Rethink last-state upgrade implementation in flink-kubernetes-operator

Yang Wang (Jira) Wed, 30 Mar 2022 19:35:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515028#comment-17515028
 ]


Yang Wang commented on FLINK-26930:
-----------------------------------

I hesitate to store the the last checkpoint path in the HA store, not only the 
K8s ConfigMap, but also ZooKeeper. Even though it is a minimal backward 
compatible change, I am just feeling it is a small temporary hack since it is 
only for exposing the checkpoint information, which will be picked up by 
external tools. Let's have more discussion here and create a new ticket if 
needed.

 

[~dmvk] is suggesting to store the retained checkpoints in the JRS[1], which 
might could not work for JobManager crash backoff scenario.

 

[1]. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435

> Rethink last-state upgrade implementation in flink-kubernetes-operator
> ----------------------------------------------------------------------
>
>                 Key: FLINK-26930
>                 URL: https://issues.apache.org/jira/browse/FLINK-26930
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Yang Wang
>            Priority: Major
>
> Following the discussion in FLINK-26916.
>  
> How the last-state upgrade works now?
> First, delete the Flink cluster directly with HA ConfigMap retained. This 
> leaves job in a "SUSPENDED" state. Then flink-kubernetes-operator will deploy 
> a new Flink application with same cluster-id so that it could recover from 
> the latest checkpoint. Please note that before starting the application, 
> JobGraph will be deleted from the HA ConfigMap. This is to ensure the newly 
> changed job options could take effect.
>  
> Some community devs are thinking to extend the JRS so the stored job result 
> contains list of retained checkpoints. This of course implies that cluster 
> gets shut down / job gets terminated properly (other cases should be used for 
> fail-over scenarios only).
>  
> As soon as there is a straightforward way of accessing the last checkpoint, 
> we should improve the current implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26930) Rethink last-state upgrade implementation in flink-kubernetes-operator

Reply via email to