[
https://issues.apache.org/jira/browse/FLINK-27572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535876#comment-17535876
]
Gyula Fora commented on FLINK-27572:
------------------------------------
[~wangyang0918] It is most useful in 1.14 but due to the ResultStore
limitations as we have seen there are cases when its required also in 1.15 (job
completes/fails and jobmanager pod dies before the first subsequent observe).
I think verifying the existence of the configmaps should be enough yes.
Yes manual fix is to find the latest external checkpoint/savepoint manually but
I think you need to delete the flinkdeployment resource completely and recreate
while specifying initialSavepointPath. The savepoint config that you mentioned
is basically ignored the way we use it.
> Verify HA Metadata present before performing last-state restore
> ---------------------------------------------------------------
>
> Key: FLINK-27572
> URL: https://issues.apache.org/jira/browse/FLINK-27572
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Gyula Fora
> Priority: Blocker
> Fix For: kubernetes-operator-1.0.0
>
>
> When we restore a job using the last-state logic we need to verify that the
> HA metadata has not been deleted. And if it's not there we need to simply
> throw an error because this requires manual user intervention.
> This only applies when the FlinkDeployment is not already in a suspended
> state with recorded last state information.
> The problem be reproduced easily in 1.14 by triggering a fatal job error.
> (turn of restart-strategy and kill TM for example). In these cases HA
> metadata will be removed, and the next last-state upgrade should throw an
> error instead of restoring from a completely empty state.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)