singhsegv opened a new issue, #5510: URL: https://github.com/apache/openwhisk/issues/5510
I am very confused about how the OpenWhisk 2.0.0 is meant to be deployed for a scalable bechmarking setting. Need some help from the maintainers to understand what am I missing since I've spent a large amount of time now and still missing some key pieces. ### Context We are using OpenWhisk for a research project where workflows (sequential as well as fork/join) are to be deployed and benchmarked at 1/4/8 RPS etc for long period of times. This is to compare private cloud FaaS vs public cloud FaaS. ### Current Infrastructure Setting We have a in-house cluster with around 10 VMs running on different nodes, 50 vCPUs and around 200Gb of memory. Since I am new to this, I've initially followed https://github.com/apache/openwhisk-deploy-kube to deploy it and along with OpenWhisk Composer, was able to get the workflows running with a lot of small fixes and changes. ### Problems with Current Infrastructure 1. I am not able to scale it properly. Running even 1 RPS for 5-10 minutes leads to a lot of random errors like "failed to get binary" and some other errors too that don't occur when running a workflow once manually. 2. Even when I reduce the benchmarking time to 10-20s, the inter-action comm time is coming out to be around 1.5-2 minutes. Grafana shows around 2 minutes going in the `/init` and I am unable to debug why that is happening. ### Main Doubts about scaling 1. Since that openwhisk-deploy-kube take a very old version of OpenWhisk, so I thought running the latest version of it without k8s and on a single machine might give some benefits. But what I've understood now is the standalone mode is not supposed to be scalable since the controller is responsible for a lot of things in v1.0.0 and haven't checked that in 2.0.0. 2. Since, deploy-kube doesn't support the latest version of OpenWhisk due to major changes in scheduler, how is OpenWhisk supposed to be deployed for a scalable infrastructure? Is there some documentation that I've missed? 3. Also in 1.0.0, the results that I've got, is there something that I am missing? Why aren't the workflows scaling? How to go about debugging the delay? Or is purely that more infra needs to be added? @style95 @dgrove-oss Since you people have been active in the community and have answered some of my previous queries too, any help on this will be very appreciated. We are planning to go all in with OpenWhisk for our research and planning to contribute some good changes back to the community relating to FaaS at edge and improving the communication times in FaaS. But since none of us have infrastructure as our strong suite, getting over these initial hiccups is a becoming a blocker for us. So looking forward to some help, thanks :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
