Heya, lengthy text ahead, bear with me :) # Proposal: Memory Aware Scheduling
## Problem Today, an invoker has a fixed number of slots, which is determined by the number of CPUs * the number of shares per CPU. This does not reflect how a serverless platform actually "sells" its resources: The user defines the amount of **memory** she wants to use. Instead of relying on the cpu shares I propose to schedule based on memory consumption. An example: Let's assume an invoker has 16 slots throughout the proposal. Today, OpenWhisk would schedule 16 * 512MB containers to one machine, even though it might not have sufficient memory to actually host those containers. The operator could adjust the coreshare accordingly to fit the worst-case amount of MAX_MEMORY containers, but that's not feasible either since that means unused memory when most users consume less memory. After this proposal, OpenWhisk will schedule based on the available memory on the invoker machine, let's assume 4GB throughout the proposal. If all users use 512MB containers, fine, there will be at most 8 containers of those on one machine. On the other hand, if all users use 128MB containers, there will be 32 containers on one machine. ## Benefits * Scheduling is tied to the resource the user can choose * Allows a broader range of memory options (MAX_MEMORY could be equal to the memory available on the biggest machine in theory) without risking over-/undercommits ## Risks * Fragmentation. It's certainly an issue the more options we provide. I'm wondering how much worse this will get compared to today though, given we already "waste" 50% of the machine, if users choose 128MB containers (and the settings are as described above). Fragmentation is "configurable" by making the list of choices the user has greater or smaller. ## Implementation The following changes are needed to implement memory based scheduling throughout the system. ### 1. Loadbalancer The loadbalancer keeps track of busy/idle slots on each invoker today. In our example, there's a Semaphore with 16 slots which are given away during scheduling. The notion of slots can be kept, but their count will be calculated/given away differently. Assuming our 4GB invokers (for simplicity I propose homogeneous invokers as a first step), each invoker will get `AVAILABLE_MEMORY / MIN_MEMORY` slots (the most parallel amount of containers possible) since that is the most containers that could be scheduled to it. When an action is scheduled, it will need to get `NEEDED_MEMORY / MIN_MEMORY` slots to be scheduled to an invoker. Following, the assignment of slots to a request will be called "weight". The loadbalancer should furthermore emit a metric of the maximum capacity of the system (`NUM_INVOKERS * AVAILABLE_MEMORY`) so the operator can tell how much is left on the system. It would also be useful to have this metric per invoker to be able to tell how much fragmentation there is in the system. ### 2. Invoker In the invoker, the `MessageFeed` logic will need adjustment. Like the loadbalancer, it could reduce it's capacity based on the weight of each request rather than reducing it by the fixed value 1. Capacity starts with the most parallel amount of containers possible (`AVAILABLE_MEMORY / MIN_MEMORY`) and is degraded by the number of slots taken by each request. This cannot be implemented in an exact way though, since the consumer always pulls N records (and records themselves don't have a weight). For example an invocation which needs 512MB in our case gets 4 slots. Upon releasing this, the feed needs to pull 4 new messages, since it doesn't know what the weight of each message is beforehand. If the next message has a weight of 4 though, 3 messages will stay in the buffer and need to wait for more capacity to be freed. In case of a crash, the messages in the buffer will be lost. The dataloss is only relevant in an overload + crash scenario since in any other case, there should not be more messages on the invoker topic than it can handle. ### 3. Configuration The operator defines a list of possible values and the system determines at startup whether that list is usable for scheduling (as in: all values are divisible by the smallest value). To configure the available memory per invoker I propose a fixed value (4GB in our example) first, passed to both controller and invoker. Future work can be to make this dynamic via the ping mechanism. Fingers bleeding, I'll stop here :) Looking forward to your input! Cheers, Markus
