antonipp opened a new pull request, #514:
URL: https://github.com/apache/flink-kubernetes-operator/pull/514

   > ⚠️ This PR is WIP, my goal is to get feedback as early as possible
   
   ## What is the purpose of the change
   https://issues.apache.org/jira/browse/FLINK-27273
   This PR makes sure that Flink applications which use Zookeeper-based HA can 
be deployed via the Kubernetes Operator.
   It wasn't previously possible mainly due to the lack of proper clean-up of 
Zookeeper data which could cause unexpected behaviour in Flink applications, as 
mentioned in the ticket.
   
   ## Brief change log
   - JobGraph-related data is now properly cleaned in Zookeeper (same behaviour 
as when using Kubernetes HA)
   - Code has been updated to support both Kubernetes and Zookeper-based HA 
settings. All features which worked with Kubernetes HA (such as JM deployment 
recovery, unhealthy job restarts or rollbacks) should also work with Zookeeper 
HA as well.
   
   ## Verifying this change
   - **TODO: add unit tests!**
   - Tried some scenarios in our Kubernetes environment:
     - `upgradeMode: savepoint`:
       - [x] Successfully deployed a Flink application with `upgradeMode: 
savepoint` and Zookeeper HA enabled
       - [x] Updated the `FlinkDeployment` object, verified that the JobGraph 
data was successfully deleted in Zookeeper and that the new version of the 
application was successfully rolled out
       - [x] Manually deleted the JobManager Kubernetes Deployment, verified 
that the Operator was able to recover it (with 
`kubernetes.operator.jm-deployment-recovery.enabled` and 
`kubernetes.operator.job.upgrade.last-state-fallback.enabled` set to `true`) 
and that the application restarted successfully (this process relies on HA 
metadata too)
     - `upgradeMode: last-state`:
       - [x] Successfully deployed a Flink application with `upgradeMode: 
last-state` and Zookeeper HA enabled
       - [x] Updated the `FlinkDeployment` object, verified that the JobGraph 
data was successfully deleted in Zookeeper and that the new version of the 
application was successfully rolled out
       - [x] Manually deleted the JobManager Kubernetes Deployment, verified 
that the Operator was able to recover it with 
`kubernetes.operator.jm-deployment-recovery.enabled` set to `true` and that the 
application restarted successfully (this process relies on HA metadata too)
     - [x] Set `kubernetes.operator.cluster.health-check.enabled` to `true` and 
deployed a Flink application which was continuously failing and restarting. 
Once the restart threshold 
(`kubernetes.operator.cluster.health-check.restarts.threshold`) was hit, the 
Operator was able to successfully validate that the Zookeeper HA metadata 
exists and restarted the job  
     - [x] Set `kubernetes.operator.deployment.rollback.enabled` to `true` and 
deployed a Flink application which was in `CrashLoopBackOff`. Once the 
`kubernetes.operator.deployment.readiness.timeout` passed, the Operator was 
able to successfully validate that the Zookeeper HA metadata exists and rolled 
back the job to the previous version. 
     - **TODO: anything else?**
   
   ## Does this pull request potentially affect one of the following parts:
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes 
   
   ## Documentation
     - Does this pull request introduce a new feature?: yes
     - If yes, how is the feature documented?: Updated the documentation where 
applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to