[ 
https://issues.apache.org/jira/browse/FLINK-27572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535876#comment-17535876
 ] 

Gyula Fora commented on FLINK-27572:
------------------------------------

[~wangyang0918] It is most useful in 1.14 but due to the ResultStore 
limitations as we have seen there are cases when its required also in 1.15 (job 
completes/fails and jobmanager pod dies before the first subsequent observe).

I think verifying the existence of the configmaps should be enough yes.

Yes manual fix is to find the latest external checkpoint/savepoint manually but 
I think you need to delete the flinkdeployment resource completely and recreate 
while specifying initialSavepointPath. The savepoint config that you mentioned 
is basically ignored the way we use it.

> Verify HA Metadata present before performing last-state restore
> ---------------------------------------------------------------
>
>                 Key: FLINK-27572
>                 URL: https://issues.apache.org/jira/browse/FLINK-27572
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Blocker
>             Fix For: kubernetes-operator-1.0.0
>
>
> When we restore a job using the last-state logic we need to verify that the 
> HA metadata has not been deleted. And if it's not there we need to simply 
> throw an error because this requires manual user intervention.
> This only applies when the FlinkDeployment is not already in a suspended 
> state with recorded last state information.
> The problem be reproduced easily in 1.14 by triggering a fatal job error. 
> (turn of restart-strategy and kill TM for example). In these cases HA 
> metadata will be removed, and the next last-state upgrade should throw an 
> error instead of restoring from a completely empty state. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to