pyttel commented on PR #10:
URL: https://github.com/apache/ozone-helm-charts/pull/10#issuecomment-2519958853

   # OM HA ideas
   
   The main features for Helm managed Ozone Manager in HA mode is based on 
`ReplicaCount` changes from one revision to
   another. Therefore, I used three Helm and Kubernetes elements listed below:
   
   - A changed argument wrapper for the start command: I did not change the 
command itself, because there is a lot of magic happening like environment to 
conf and so on. As usual in Helm commands, I changed the command to handle 
logic for the startup of OM instances. The main idea is to compare currently 
used replica count with replica count configured in the next helm chart 
revision. It is done by utilizing `lookup` function from Helm 3. If the 
difference from OM replica count ist zero nothing special to do at this point. 
If it is negative, which means the number of OM instances will be reduced, 
decommissioning for these nodes must be triggered. This is not relevant for the 
place here, but if the difference is positive, bootstrapping of new OM 
instances must happen. As bootstrap for an instance(=node) should only be done 
once, a file on successful bootstrap is written to Helm persistence path from 
`values.yaml`. On every start the file existence is checked and only if the 
file is mi
 ssing bootstrap argument will be added to `args` configured `values.yaml`. 
Currently, the success of bootstrap is tagged by another background process 
based on the log output. **This must be changed for production release soon, as 
it depends on LOG-LEVEL min INFO.** The important part of bootstrapping in the 
kubernetes context is when more than one container is added at one helm 
revision (delta>1). The bootstrapping process is checking if all configs are 
updated with new nodes. Because of ordered container creations, the container 
instances with index bigger than the current bootstrapping one are not created, 
yet and thus these are not resolvable over DNS at that point of time. After a 
time, the bootstrapping process will be killed by timeout of Kubernetes and the 
current container is not resolved as ready. So further instances are not 
created anymore. So we get stuck in Bootstrapping deadlock here. I tried to 
resolve this by adding parallel marker for statefulset, so all pods are c
 reated at the same time. But this caused other problems while bootstrapping 
two instances at the same time. As it did not feel good, I wanted to use normal 
Kubernetes ordered container creation. Back to ordered container creation, the 
bootstrapping config for new container must not contain node ids of container 
instances which will be created after the current one. So the bootstrapping 
must be done one by one with such a config. This is why I overwrite the node 
ids while bootstrapping an instance only containing the instances already 
available in Kubernetes at that point of time and the current bootstrapping 
one. This works like a charm and we stick to the normal Kubernetes behavior.
   - A helm post upgrade job. This is needed for decommissioning. If the 
above-described delta is negative, this job is created as a helm post-upgrade 
job. At that point of time all config reloads of existing instances are 
triggered by kubernetes because of the checksum annotation I changed a bit. For 
each decommissioned node an own decommissioning job is created and a temporary 
service to reach this jobs is created. The PVC of each decommissioned node is 
mounted to its job. So after decreasing the instance number old instances are 
still available for the decommissioning job. In each job the decommissioning 
command is executed. If this has been successful all data of PVC is cleared to 
enable bootstrapping again in future. Please check the `_helper.tpl` for 
different env setup in these phases! One issue I faced while testing is the 
leader. If a leader becomes unavailable and should be decommissioned there were 
some problems. I decided to introduce a pre-upgrade job to handle the leade
 r transfer to instance with index 0.
   - A helm pre upgrade job to transfer leader. This is an additional temporary 
container with special config (please see `_helper.tpl` which took me a while 
to determine what is the correct config. this runs before any container is 
updated and transfers the leader to instance with index 0.
   
   This solution seems to be fail-safe, dynamic and works without the `--force` 
argument. I hope this is a good point to start discussions about solving the 
challenging task. 
   
   ##  My ordered testing cases:
   
   1. Create ozone deployment in persistent mode with OM HA 3 instances
   2. Update helm chart revision with OM replica count 4 (one bootstrap node)
   3. Update helm chart revision with OM replica count 7 (multiple Bootstrap 
node)
   4. Update helm chart revision with OM replica count 7 but changed config 
(Same replica count)
   5. Transfer OM Leader to last instance with id 6 (Check if leader is 
automatically transferred to instance with id 0)
   5. Update helm chart revision with OM replica count 6 (Decommission one node)
   6. Update helm chart revision with OM replica count 3 (Decommission multiple 
nodes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to