Hi Mesos users and developers,

In Mesos 0.20.0 we’d like to release a feature called framework rate limiting.

What is Framework Rate Limiting

In a multi-framework environment, this feature aims to protect the throughput 
of high-SLA (e.g., production, service) frameworks by having the master 
throttle messages from other (e.g., development, batch) frameworks.

To throttle messages from a framework, the operator sets a qps (queries per 
seconds) value for each framework identified by its principal (You can also 
throttle a group of frameworks together but we’ll assume individual frameworks 
in this email; see the config definition here; we’ll publish a user doc soon). 
The master then promises not to process messages from that framework at a rate 
above qps (but possibly below it, especially if the qps is set too high that 
master cannot keep up). The outstanding messages are stored in memory on the 
master.

We’ll publish more detailed guidance on how to set the qps (as well as the 
capacity which is described below) in the user doc.

Limitation and Future Work

To prevent the number of outstanding messages from growing unboundedly and 
crashing the master due to out of memory, the operator can set a capacity for 
the framework which is the max number of outstanding messages allowed. When a 
framework exceeds the capacity, a FrameworkErrorMessage is sent back to the 
framework which will abort the scheduler driver and invoke the error() 
callback. It doesn’t kill any tasks or the scheduler itself and the framework 
developer can choose to restart or failover the scheduler instance to remedy 
the consequences of dropped messages (unless your framework doesn’t assume all 
messages sent to the master are processed).

The above is the behavior we are planning for 0.20.0 with respect to framework 
rate limiting. We are going to iterate on this feature by having the master 
send an early alert when the message queue for this framework starts to build 
up (MESOS–1664). The scheduler can react by throttling itself (to avoid the 
error message) or ignoring this alert if it’s just a temporary burst. 
MESOS–1664 is likely not going to land in 0.20.0.

Due to this limitation we don’t recommend using the rate limiting feature to 
throttle production frameworks for now unless you are sure about the 
consequences of the error message. Of course it’s OK to use it to protect 
production frameworks by throttling other frameworks and it doesn’t have any 
effect on the master if it’s not explicitly enabled.

We are excited about this feature and hope to release it in 0.20.0. Your 
comments about the feature and this limitation are welcome and appreciated! If 
there is no objection within 3 days we’ll assume lazy consensus and put it in 
the release so you can start using it.

Thanks!
Yan

Reply via email to