pyttel commented on PR #10: URL: https://github.com/apache/ozone-helm-charts/pull/10#issuecomment-2519958853
# OM HA ideas The main features for Helm managed Ozone Manager in HA mode is based on `ReplicaCount` changes from one revision to another. Therefore, I used three Helm and Kubernetes elements listed below: - A changed argument wrapper for the start command: I did not change the command itself, because there is a lot of magic happening like environment to conf and so on. As usual in Helm commands, I changed the command to handle logic for the startup of OM instances. The main idea is to compare currently used replica count with replica count configured in the next helm chart revision. It is done by utilizing `lookup` function from Helm 3. If the difference from OM replica count ist zero nothing special to do at this point. If it is negative, which means the number of OM instances will be reduced, decommissioning for these nodes must be triggered. This is not relevant for the place here, but if the difference is positive, bootstrapping of new OM instances must happen. As bootstrap for an instance(=node) should only be done once, a file on successful bootstrap is written to Helm persistence path from `values.yaml`. On every start the file existence is checked and only if the file is mi ssing bootstrap argument will be added to `args` configured `values.yaml`. Currently, the success of bootstrap is tagged by another background process based on the log output. **This must be changed for production release soon, as it depends on LOG-LEVEL min INFO.** The important part of bootstrapping in the kubernetes context is when more than one container is added at one helm revision (delta>1). The bootstrapping process is checking if all configs are updated with new nodes. Because of ordered container creations, the container instances with index bigger than the current bootstrapping one are not created, yet and thus these are not resolvable over DNS at that point of time. After a time, the bootstrapping process will be killed by timeout of Kubernetes and the current container is not resolved as ready. So further instances are not created anymore. So we get stuck in Bootstrapping deadlock here. I tried to resolve this by adding parallel marker for statefulset, so all pods are c reated at the same time. But this caused other problems while bootstrapping two instances at the same time. As it did not feel good, I wanted to use normal Kubernetes ordered container creation. Back to ordered container creation, the bootstrapping config for new container must not contain node ids of container instances which will be created after the current one. So the bootstrapping must be done one by one with such a config. This is why I overwrite the node ids while bootstrapping an instance only containing the instances already available in Kubernetes at that point of time and the current bootstrapping one. This works like a charm and we stick to the normal Kubernetes behavior. - A helm post upgrade job. This is needed for decommissioning. If the above-described delta is negative, this job is created as a helm post-upgrade job. At that point of time all config reloads of existing instances are triggered by kubernetes because of the checksum annotation I changed a bit. For each decommissioned node an own decommissioning job is created and a temporary service to reach this jobs is created. The PVC of each decommissioned node is mounted to its job. So after decreasing the instance number old instances are still available for the decommissioning job. In each job the decommissioning command is executed. If this has been successful all data of PVC is cleared to enable bootstrapping again in future. Please check the `_helper.tpl` for different env setup in these phases! One issue I faced while testing is the leader. If a leader becomes unavailable and should be decommissioned there were some problems. I decided to introduce a pre-upgrade job to handle the leade r transfer to instance with index 0. - A helm pre upgrade job to transfer leader. This is an additional temporary container with special config (please see `_helper.tpl` which took me a while to determine what is the correct config. this runs before any container is updated and transfers the leader to instance with index 0. This solution seems to be fail-safe, dynamic and works without the `--force` argument. I hope this is a good point to start discussions about solving the challenging task. ## My ordered testing cases: 1. Create ozone deployment in persistent mode with OM HA 3 instances 2. Update helm chart revision with OM replica count 4 (one bootstrap node) 3. Update helm chart revision with OM replica count 7 (multiple Bootstrap node) 4. Update helm chart revision with OM replica count 7 but changed config (Same replica count) 5. Transfer OM Leader to last instance with id 6 (Check if leader is automatically transferred to instance with id 0) 5. Update helm chart revision with OM replica count 6 (Decommission one node) 6. Update helm chart revision with OM replica count 3 (Decommission multiple nodes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
