Gyula Fora created FLINK-27820: ---------------------------------- Summary: Handle Upgrade/Deployment errors gracefully Key: FLINK-27820 URL: https://issues.apache.org/jira/browse/FLINK-27820 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.0.0 Reporter: Gyula Fora Assignee: Gyula Fora Fix For: kubernetes-operator-1.1.0
The operator currently cannot gracefully handle the cases when there is a failure during (or directly after & and before updating the status) job submission. This applies to both initial cluster submissions when a Flink CR was created but more importantly during upgrades. This is slightly related to https://issues.apache.org/jira/browse/FLINK-27804 where mid-upgrade observe was disabled to workaround some issues, this logic should also be improved to only skip observing last-state info for already finished jobs (that were observed before). During upgrades, the observer should be able to recognize when the job/cluster was actually submitted already even if the status update subsequently failed and move the status into a healthy DEPLOYED state. -- This message was sent by Atlassian Jira (v8.20.7#820007)