Hi Markus.

Great proposal!
So, as per your proposal now operator can configure all memory types and
loadbalancer schdules actions with different weight on slots based on
memory.
And max slots will be calculated based on memory and minimum memory type.
In case we have 4 GB, with following memory types; 128MB, 256MB, 512MB,
slot size will be 32(4GB/128MB) and loadbalancer will assign 2 slots, 4
slots for 256MB, 512MB actions respectively.

Few questions are, how do loadbalancers keep the state of memory footprint,
how are slots shared among loadbalancers, and are you also thinking of any
change on throttling mechanism?
As now one action can occupy multiple slots, there might be some changes on
throttling as well.

And I expect with this change, operator can configure types and the number
of prewarm containers as well.

Thanks
Regards
Dominic





2018-05-10 6:50 GMT+09:00 Markus Thoemmes <[email protected]>:

> Heya, lengthy text ahead, bear with me :)
>
> # Proposal: Memory Aware Scheduling
>
> ## Problem
>
> Today, an invoker has a fixed number of slots, which is determined by the
> number of CPUs * the number of shares per CPU. This does not reflect how a
> serverless platform actually "sells" its resources: The user defines the
> amount of **memory** she wants to use.
>
> Instead of relying on the cpu shares I propose to schedule based on memory
> consumption.
>
> An example:
>
> Let's assume an invoker has 16 slots throughout the proposal. Today,
> OpenWhisk would schedule 16 * 512MB containers to one machine, even though
> it might not have sufficient memory to actually host those containers. The
> operator could adjust the coreshare accordingly to fit the worst-case
> amount of MAX_MEMORY containers, but that's not feasible either since that
> means unused memory when most users consume less memory.
>
> After this proposal, OpenWhisk will schedule based on the available memory
> on the invoker machine, let's assume 4GB throughout the proposal. If all
> users use 512MB containers, fine, there will be at most 8 containers of
> those on one machine. On the other hand, if all users use 128MB containers,
> there will be 32 containers on one machine.
>
> ## Benefits
>
> * Scheduling is tied to the resource the user can choose
> * Allows a broader range of memory options (MAX_MEMORY could be equal to
> the memory available on the biggest machine in theory) without risking
> over-/undercommits
>
> ## Risks
>
> * Fragmentation. It's certainly an issue the more options we provide. I'm
> wondering how much worse this will get compared to today though, given we
> already "waste" 50% of the machine, if users choose 128MB containers (and
> the settings are as described above). Fragmentation is "configurable" by
> making the list of choices the user has greater or smaller.
>
> ## Implementation
>
> The following changes are needed to implement memory based scheduling
> throughout the system.
>
> ### 1. Loadbalancer
>
> The loadbalancer keeps track of busy/idle slots on each invoker today. In
> our example, there's a Semaphore with 16 slots which are given away during
> scheduling. The notion of slots can be kept, but their count will be
> calculated/given away differently. Assuming our 4GB invokers (for
> simplicity I propose homogeneous invokers as a first step), each invoker
> will get `AVAILABLE_MEMORY / MIN_MEMORY` slots (the most parallel amount of
> containers possible) since that is the most containers that could be
> scheduled to it. When an action is scheduled, it will need to get
> `NEEDED_MEMORY / MIN_MEMORY` slots to be scheduled to an invoker.
> Following, the assignment of slots to a request will be called "weight".
>
> The loadbalancer should furthermore emit a metric of the maximum capacity
> of the system (`NUM_INVOKERS * AVAILABLE_MEMORY`) so the operator can tell
> how much is left on the system. It would also be useful to have this metric
> per invoker to be able to tell how much fragmentation there is in the
> system.
>
> ### 2. Invoker
>
> In the invoker, the `MessageFeed` logic will need adjustment. Like the
> loadbalancer, it could reduce it's capacity based on the weight of each
> request rather than reducing it by the fixed value 1. Capacity starts with
> the most parallel amount of containers possible (`AVAILABLE_MEMORY /
> MIN_MEMORY`) and is degraded by the number of slots taken by each request.
>
> This cannot be implemented in an exact way though, since the consumer
> always pulls N records (and records themselves don't have a weight). For
> example an invocation which needs 512MB in our case gets 4 slots. Upon
> releasing this, the feed needs to pull 4 new messages, since it doesn't
> know what the weight of each message is beforehand. If the next message has
> a weight of 4 though, 3 messages will stay in the buffer and need to wait
> for more capacity to be freed. In case of a crash, the messages in the
> buffer will be lost.
>
> The dataloss is only relevant in an overload + crash scenario since in any
> other case, there should not be more messages on the invoker topic than it
> can handle.
>
> ### 3. Configuration
>
> The operator defines a list of possible values and the system determines
> at startup whether that list is usable for scheduling (as in: all values
> are divisible by the smallest value).
>
> To configure the available memory per invoker I propose a fixed value (4GB
> in our example) first, passed to both controller and invoker. Future work
> can be to make this dynamic via the ping mechanism.
>
>
> Fingers bleeding, I'll stop here :)
>
> Looking forward to your input!
>
> Cheers,
> Markus
>
>

Reply via email to