antonipp opened a new pull request, #514: URL: https://github.com/apache/flink-kubernetes-operator/pull/514
> ⚠️ This PR is WIP, my goal is to get feedback as early as possible ## What is the purpose of the change https://issues.apache.org/jira/browse/FLINK-27273 This PR makes sure that Flink applications which use Zookeeper-based HA can be deployed via the Kubernetes Operator. It wasn't previously possible mainly due to the lack of proper clean-up of Zookeeper data which could cause unexpected behaviour in Flink applications, as mentioned in the ticket. ## Brief change log - JobGraph-related data is now properly cleaned in Zookeeper (same behaviour as when using Kubernetes HA) - Code has been updated to support both Kubernetes and Zookeper-based HA settings. All features which worked with Kubernetes HA (such as JM deployment recovery, unhealthy job restarts or rollbacks) should also work with Zookeeper HA as well. ## Verifying this change - **TODO: add unit tests!** - Tried some scenarios in our Kubernetes environment: - `upgradeMode: savepoint`: - [x] Successfully deployed a Flink application with `upgradeMode: savepoint` and Zookeeper HA enabled - [x] Updated the `FlinkDeployment` object, verified that the JobGraph data was successfully deleted in Zookeeper and that the new version of the application was successfully rolled out - [x] Manually deleted the JobManager Kubernetes Deployment, verified that the Operator was able to recover it (with `kubernetes.operator.jm-deployment-recovery.enabled` and `kubernetes.operator.job.upgrade.last-state-fallback.enabled` set to `true`) and that the application restarted successfully (this process relies on HA metadata too) - `upgradeMode: last-state`: - [x] Successfully deployed a Flink application with `upgradeMode: last-state` and Zookeeper HA enabled - [x] Updated the `FlinkDeployment` object, verified that the JobGraph data was successfully deleted in Zookeeper and that the new version of the application was successfully rolled out - [x] Manually deleted the JobManager Kubernetes Deployment, verified that the Operator was able to recover it with `kubernetes.operator.jm-deployment-recovery.enabled` set to `true` and that the application restarted successfully (this process relies on HA metadata too) - [x] Set `kubernetes.operator.cluster.health-check.enabled` to `true` and deployed a Flink application which was continuously failing and restarting. Once the restart threshold (`kubernetes.operator.cluster.health-check.restarts.threshold`) was hit, the Operator was able to successfully validate that the Zookeeper HA metadata exists and restarted the job - [x] Set `kubernetes.operator.deployment.rollback.enabled` to `true` and deployed a Flink application which was in `CrashLoopBackOff`. Once the `kubernetes.operator.deployment.readiness.timeout` passed, the Operator was able to successfully validate that the Zookeeper HA metadata exists and rolled back the job to the previous version. - **TODO: anything else?** ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: yes ## Documentation - Does this pull request introduce a new feature?: yes - If yes, how is the feature documented?: Updated the documentation where applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
