Heya, lengthy text ahead, bear with me :)

# Proposal: Memory Aware Scheduling

## Problem

Today, an invoker has a fixed number of slots, which is determined by the 
number of CPUs * the number of shares per CPU. This does not reflect how a 
serverless platform actually "sells" its resources: The user defines the amount 
of **memory** she wants to use.

Instead of relying on the cpu shares I propose to schedule based on memory 
consumption. 

An example:

Let's assume an invoker has 16 slots throughout the proposal. Today, OpenWhisk 
would schedule 16 * 512MB containers to one machine, even though it might not 
have sufficient memory to actually host those containers. The operator could 
adjust the coreshare accordingly to fit the worst-case amount of MAX_MEMORY 
containers, but that's not feasible either since that means unused memory when 
most users consume less memory.

After this proposal, OpenWhisk will schedule based on the available memory on 
the invoker machine, let's assume 4GB throughout the proposal. If all users use 
512MB containers, fine, there will be at most 8 containers of those on one 
machine. On the other hand, if all users use 128MB containers, there will be 32 
containers on one machine.

## Benefits

* Scheduling is tied to the resource the user can choose
* Allows a broader range of memory options (MAX_MEMORY could be equal to the 
memory available on the biggest machine in theory) without risking 
over-/undercommits

## Risks

* Fragmentation. It's certainly an issue the more options we provide. I'm 
wondering how much worse this will get compared to today though, given we 
already "waste" 50% of the machine, if users choose 128MB containers (and the 
settings are as described above). Fragmentation is "configurable" by making the 
list of choices the user has greater or smaller.  

## Implementation

The following changes are needed to implement memory based scheduling 
throughout the system.

### 1. Loadbalancer

The loadbalancer keeps track of busy/idle slots on each invoker today. In our 
example, there's a Semaphore with 16 slots which are given away during 
scheduling. The notion of slots can be kept, but their count will be 
calculated/given away differently. Assuming our 4GB invokers (for simplicity I 
propose homogeneous invokers as a first step), each invoker will get 
`AVAILABLE_MEMORY / MIN_MEMORY` slots (the most parallel amount of containers 
possible) since that is the most containers that could be scheduled to it. When 
an action is scheduled, it will need to get `NEEDED_MEMORY / MIN_MEMORY` slots 
to be scheduled to an invoker. Following, the assignment of slots to a request 
will be called "weight".

The loadbalancer should furthermore emit a metric of the maximum capacity of 
the system (`NUM_INVOKERS * AVAILABLE_MEMORY`) so the operator can tell how 
much is left on the system. It would also be useful to have this metric per 
invoker to be able to tell how much fragmentation there is in the system.

### 2. Invoker

In the invoker, the `MessageFeed` logic will need adjustment. Like the 
loadbalancer, it could reduce it's capacity based on the weight of each request 
rather than reducing it by the fixed value 1. Capacity starts with the most 
parallel amount of containers possible (`AVAILABLE_MEMORY / MIN_MEMORY`) and is 
degraded by the number of slots taken by each request.

This cannot be implemented in an exact way though, since the consumer always 
pulls N records (and records themselves don't have a weight). For example an 
invocation which needs 512MB in our case gets 4 slots. Upon releasing this, the 
feed needs to pull 4 new messages, since it doesn't know what the weight of 
each message is beforehand. If the next message has a weight of 4 though, 3 
messages will stay in the buffer and need to wait for more capacity to be 
freed. In case of a crash, the messages in the buffer will be lost.

The dataloss is only relevant in an overload + crash scenario since in any 
other case, there should not be more messages on the invoker topic than it can 
handle.

### 3. Configuration

The operator defines a list of possible values and the system determines at 
startup whether that list is usable for scheduling (as in: all values are 
divisible by the smallest value).

To configure the available memory per invoker I propose a fixed value (4GB in 
our example) first, passed to both controller and invoker. Future work can be 
to make this dynamic via the ping mechanism.


Fingers bleeding, I'll stop here :)

Looking forward to your input!

Cheers,
Markus

Reply via email to