[GitHub] [openwhisk] style95 commented on pull request #5338: [WIP] Add zero downtime deployment

GitBox Mon, 24 Oct 2022 18:48:52 -0700


style95 commented on PR #5338:
URL: https://github.com/apache/openwhisk/pull/5338#issuecomment-1289868836

> I'm curious on the test. Do you think that the additional latency could
just be waiting for the kafka consumer / producer to start up? If so that
should be rather easy to fix to just wait for the message feed to acknowledge
it's finished initializing before marking the service healthy to accept traffic.

It would be worth analyzing further, though each component has its own
warm-up procedure.
Each component has its own warm-up procedure.
-
[PoolBalancer](https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/FPCPoolBalancer.scala#L512)
-
[ContainerManager](https://github.com/apache/openwhisk/blob/master/core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala#L303)
-
[InvokerReactive](https://github.com/apache/openwhisk/blob/master/core/invoker/src/main/scala/org/apache/openwhisk/core/invoker/FPCInvokerReactive.scala#L407)

When a new scheduler endpoint is inserted in ETCD, controllers and invokers
warm up the path.
When an invoker endpoint is inserted, schedulers warm up the container
creation path.
Finally, [controllers invoke a warm-up
activation](https://github.com/apache/openwhisk/pull/5338/files#diff-379956107a78f18274d3895bbbb268b964cc5f89ac3601333c4255cc0bb7632dR450)
during the deployment time.
It would be great to double-check the above functionality but I hope that is
handled in the subsequent PR as this PR is getting bigger.

> Also would be helpful knowing the exact series of restart events with the
graph you shared. Are one of each component all being restarted at once? Would
we see the same increased latency if we did the controller restart, the
scheduler restart, the invoker restart all isolated from one another?

Let me share the procedures.

### (Prerequisite) It's important to change the serial value for each
component to 1.

```bash
cat > ${OPENWHISK_HOME}/ansible/invoker.yml << EOL
---
# This playbook deploys Openwhisk Invokers.

- hosts: invokers
vars:
host_group: "{{ groups['invokers'] }}"
name_prefix: "invoker"
invoker_index_base: 0
serial: '${INVOKER_SERIAL}' ### this should be 1
roles:
- invoker
EOL
```

So all components are deployed one by one.

### Deploy OpenWhisk in multiple steps.

I deployed OW in multiple steps and the deployment order is slightly
different.

```bash
if [ "$STEP" == "1" ]; then
## step 1
$ANSIBLE_CMD invoker.yml --limit invokers[0:3] # 4 normal invokers
$ANSIBLE_CMD scheduler.yml --limit schedulers[0:0] # 3 schedulers
$ANSIBLE_CMD controller.yml --limit controllers[0:0] # 2 controllers
fi
if [ "$STEP" == "2" ]; then
## step 2
$ANSIBLE_CMD controller.yml --limit controllers[1:1] # left 1
controller
$ANSIBLE_CMD scheduler.yml --limit schedulers[1:1] # left 1 schedulers
$ANSIBLE_CMD invoker.yml --limit invokers[4:6] # 3 invokers
fi
if [ "$STEP" == "3" ]; then
## step 3
$ANSIBLE_CMD controller.yml --limit controllers[2:2] # left 1
controller
$ANSIBLE_CMD scheduler.yml --limit schedulers[2:2] # left 1 schedulers
$ANSIBLE_CMD invoker.yml --limit invokers[7:9] # 3 invokers
fi
```

First, with the assumption that all components support graceful shutdown, we
need to disable and deploy some invokers.
Once invokers are deployed, they have a different etcd prefix with old
invokers, no request is sent to them from old schedulers.

Second, we can disable and deploy one scheduler.
Once the scheduler is deployed, it also has a different prefix and no
request is sent to this scheduler from old controllers.
At this point, schedulers and invokers are idle.

Third, we deploy one controller. During the deployment, the controller is
removed from nginx. So no request is sent to the controller. Once it is
deployed, it can only see the newly deployed scheduler and invokers. After then
it is re-added to Nginx.
So there are two versions of OW running at this point.

After one step is finished, now we can deploy components in reverse order as
there are already available schedulers and invokers.

Since the serial value of each component is 1, we don't necessarily need to
split the procedure into 3 steps.
But at least 2 steps are required for zero downtime deployment.

In the above diagram, there are three points that TPS became 0 shortly.
Those points are where controllers are deployed.
Except for those points, other components were being deployed one by one.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [openwhisk] style95 commented on pull request #5338: [WIP] Add zero downtime deployment

Reply via email to