style95 commented on PR #5338: URL: https://github.com/apache/openwhisk/pull/5338#issuecomment-1289868836
> I'm curious on the test. Do you think that the additional latency could just be waiting for the kafka consumer / producer to start up? If so that should be rather easy to fix to just wait for the message feed to acknowledge it's finished initializing before marking the service healthy to accept traffic. It would be worth analyzing further, though each component has its own warm-up procedure. Each component has its own warm-up procedure. - [PoolBalancer](https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/FPCPoolBalancer.scala#L512) - [ContainerManager](https://github.com/apache/openwhisk/blob/master/core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala#L303) - [InvokerReactive](https://github.com/apache/openwhisk/blob/master/core/invoker/src/main/scala/org/apache/openwhisk/core/invoker/FPCInvokerReactive.scala#L407) When a new scheduler endpoint is inserted in ETCD, controllers and invokers warm up the path. When an invoker endpoint is inserted, schedulers warm up the container creation path. Finally, [controllers invoke a warm-up activation](https://github.com/apache/openwhisk/pull/5338/files#diff-379956107a78f18274d3895bbbb268b964cc5f89ac3601333c4255cc0bb7632dR450) during the deployment time. It would be great to double-check the above functionality but I hope that is handled in the subsequent PR as this PR is getting bigger. > Also would be helpful knowing the exact series of restart events with the graph you shared. Are one of each component all being restarted at once? Would we see the same increased latency if we did the controller restart, the scheduler restart, the invoker restart all isolated from one another? Let me share the procedures. ### (Prerequisite) It's important to change the serial value for each component to 1. ```bash cat > ${OPENWHISK_HOME}/ansible/invoker.yml << EOL --- # This playbook deploys Openwhisk Invokers. - hosts: invokers vars: host_group: "{{ groups['invokers'] }}" name_prefix: "invoker" invoker_index_base: 0 serial: '${INVOKER_SERIAL}' ### this should be 1 roles: - invoker EOL ``` So all components are deployed one by one. ### Deploy OpenWhisk in multiple steps. I deployed OW in multiple steps and the deployment order is slightly different. ```bash if [ "$STEP" == "1" ]; then ## step 1 $ANSIBLE_CMD invoker.yml --limit invokers[0:3] # 4 normal invokers $ANSIBLE_CMD scheduler.yml --limit schedulers[0:0] # 3 schedulers $ANSIBLE_CMD controller.yml --limit controllers[0:0] # 2 controllers fi if [ "$STEP" == "2" ]; then ## step 2 $ANSIBLE_CMD controller.yml --limit controllers[1:1] # left 1 controller $ANSIBLE_CMD scheduler.yml --limit schedulers[1:1] # left 1 schedulers $ANSIBLE_CMD invoker.yml --limit invokers[4:6] # 3 invokers fi if [ "$STEP" == "3" ]; then ## step 3 $ANSIBLE_CMD controller.yml --limit controllers[2:2] # left 1 controller $ANSIBLE_CMD scheduler.yml --limit schedulers[2:2] # left 1 schedulers $ANSIBLE_CMD invoker.yml --limit invokers[7:9] # 3 invokers fi ``` First, with the assumption that all components support graceful shutdown, we need to disable and deploy some invokers. Once invokers are deployed, they have a different etcd prefix with old invokers, no request is sent to them from old schedulers. Second, we can disable and deploy one scheduler. Once the scheduler is deployed, it also has a different prefix and no request is sent to this scheduler from old controllers. At this point, schedulers and invokers are idle. Third, we deploy one controller. During the deployment, the controller is removed from nginx. So no request is sent to the controller. Once it is deployed, it can only see the newly deployed scheduler and invokers. After then it is re-added to Nginx. So there are two versions of OW running at this point. After one step is finished, now we can deploy components in reverse order as there are already available schedulers and invokers. Since the serial value of each component is 1, we don't necessarily need to split the procedure into 3 steps. But at least 2 steps are required for zero downtime deployment. In the above diagram, there are three points that TPS became 0 shortly. Those points are where controllers are deployed. Except for those points, other components were being deployed one by one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
