style95 commented on issue #5256: URL: https://github.com/apache/openwhisk/issues/5256#issuecomment-1145878594
Let me share the current behavior and my opinion. The new scheduler is designed based on the assumption that latency is the most important. Some users of our downstream were sensitive to the latency. Some of them were even concerned about the hundreds of milliseconds of the wait time. And we didn't want to accept a few seconds of the wait time. Based on this idea, let me share how the new scheduler works. First, the scheduler will look for the average duration of the given action. Once the action is invoked at least one time, there will be activations and we can figure out the average duration. This is handled by `ElasticSearchDurationChecker`. When a memory queue starts up, it will try to get the average duration. Once we get the duration, we can estimate the processing power of one container for the given action. For example, if the duration is 10ms, theoretically one container can handle 100 activations in 1 second while it will handle only 1 activation for an action with 1s duration. We can easily calculate the required number of containers. If an action had never been invoked, we can't figure out the average duration and the scheduler will just create one container. If the action quickly finishes, then we can figure out the average duration again and it would work accordingly. On the other hand, if it takes more than the scheduling interval(100ms) to finish, Then the duration of action is at least bigger than 100ms, it could take 1s ~ 10s. Since we have no idea yet the scheduler will add the same number of containers with the number of stale activations in the queue. This is where staleness is introduced. One more thing to consider is that even for short-running actions, some activations can be stale. We properly calculated the required number of containers but the duration can vary and some messages can be stale while some containers are running. It stands for existing containers are not enough to handle existing activations. Let's say 10 activations are incoming every 100 milliseconds and existing containers could only handle 7 activations during 100ms, we need to add more containers. So we calculate the required number of additional containers based on the number of stale activations and the average duration. ```scala val containerThroughput = StaleThreshold / duration val num = ceiling(availableMsg.toDouble / containerThroughput) ``` Also, if the calculated `num` is 5 while there are only 3 activations in the queue, we don't need to add 5 containers as 2 of them will be idle. So we only add 3 containers. Considering the fact that this case can repeatedly happen because container creation generally takes more than 100ms, we should take the number of in-progress(being created) containers into account. ```scala val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress ``` This is basically how the new scheduler works. Now the issue is, in the case of long-running actions such as 10s, since its processing power would be 0.01(100ms/10s), the scheduler will try to create the same number of containers with the number of activations. So when 10 activations come, it will try to create 10 containers to handle them. When 100 activations come, it will create 100 containers. (while at the beginning it will only add one container as an initial container.) So this could end up spending all resources, we have to properly throttle them with the namespace limit. If the namespace limit is 30, then only 30 containers will be created and 70 activations will be waiting in the queue. Only after 40 seconds, after 4 rounds, all activations will be handled. This is an example, but we thought the 40s of the wait time is too much and we wanted to minimize wait time no matter which kind of action is running. But this could create a huge number of containers within a short period of time and it could overload the Docker engine or a K8S API server. Also, if one action spawns a huge number of containers, it would affect other actions too as the Docker engine would be busy creating them. Regarding the idea to increase the staleness threshold, I am not sure. Some users still may want the short wait time even if their actions are long-running actions. Maybe we can introduce another throttling for container creation and it should consider the fairness among actions. Also on the invoker side, it should create containers in batch with a limit on the number of containers in each batch. (The Docker client already has such a batch but the K8S client doesn't.) And it would be great for OW operators if we can control the aggressiveness(whether to create containers more or less aggressively) of the scheduler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
