[
https://issues.apache.org/jira/browse/FLINK-27675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-27675:
-----------------------------------
Labels: pull-request-available (was: )
> Improve manual savepoint tracking
> ---------------------------------
>
> Key: FLINK-27675
> URL: https://issues.apache.org/jira/browse/FLINK-27675
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Gyula Fora
> Assignee: Matyas Orhidi
> Priority: Blocker
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.0.0
>
>
> There are 2 problems with the manual savpeoint result observing logic that
> can cause the reconciler to not make progress with the deployment
> (recoveries, upgrades etc).
> # Whenever the jobmanager deployment is not in READY state or the job itself
> is not RUNNING, the trigger info must be reset and we should not try to query
> it anymore. Flink will not retry the savepoint if the job fails, restarted
> anyways.
> # If there is a sensible error when fetching the savepoint status (such as:
> There is no savepoint operation with triggerId=xxx for job ) we should simply
> reset the trigger. These errors will never go away on their own and will
> simply cause the deployment to get stuck in observing/waiting for a savepoint
> to complete
--
This message was sent by Atlassian Jira
(v8.20.7#820007)