Re: Review Request 35702: Added /reserve HTTP endpoint to the master.

Michael Park Tue, 28 Jul 2015 14:04:12 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35702/
-----------------------------------------------------------

(Updated July 28, 2015, 9:03 p.m.)

Review request for mesos, Adam B, Benjamin Hindman, Ben Mahler, Jie Yu, Joris
Van Remoortere, and Vinod Kone.

Changes
-------

Only rescind offers if rescinding the offer will contribute in satisfying the
request

Bugs: MESOS-2600
https://issues.apache.org/jira/browse/MESOS-2600

Repository: mesos

Description
-------

This involved a lot more challenges than I anticipated, I've captured the
various approaches and limitations and deal-breakers of those approaches here:
[Master Endpoint Implementation
Challenges](https://docs.google.com/document/d/1cwVz4aKiCYP9Y4MOwHYZkyaiuEv7fArCye-vPvB2lAI/edit#)

Key points:

* This is a stop-gap solution until we shift the offer creation/management
logic from the master to the allocator.
* `updateAvailable` and `updateSlave` are kept separate because
(1) `updateAvailable` is allowed to fail whereas `updateSlave` must not.
(2) `updateAvailable` returns a `Future` whereas `updateSlave` does not.
(3) `updateAvailable` never leaves the allocator in an over-allocated state
and must not, whereas `updateSlave` does, and can.
* The algorithm:
* Initially, the master pessimistically assume that what seems like
"available" resources will be gone.
This is due to the race between the allocator scheduling an `allocate`
call to itself vs master's `allocator->updateAvailable` invocation.
As such, we first try to satisfy the request only with the offered
resources.
* We greedily rescind one offer at a time until we've rescinded
sufficiently many offers.
IMPORTANT: We perform `recoverResources(..., Filters())` rather than
`recoverResources(..., None())` so that we can pretty much always win the race
against `allocate`.
In the case that we lose, no disaster occurs. We simply fail
to satisfy the request.
* If we still don't have enough resources after resciding all offers, be
optimistic and forward the request to the allocator since there may be
available resources to satisfy the request.
* If the allocator returns a failure, report the error to the user with
`PreconditionFailed`. This could be updated to be `Forbidden`, or `Conflict`
maybe as well. We'll pick one eventually.

This approach is clearly not ideal, since we would prefer to rescind as little
offers as possible.
The challenges of implementing the ideal solution in the current state is
described in the document above.

TODO(mpark): Add more comments and test cases.

Diffs (updated)
-----

src/master/http.cpp 3a1598fad4db03e5f62fd4a6bd26b2bedeee4070
src/master/master.hpp 827d0d599912b2936beb9615610f627f6c9a2d43
src/master/master.cpp 5b5e3c37d4433c8524db267866aebc0a35a181f1
src/master/validation.hpp 469d6f56c3de28a34177124aae81ce24cb4ad160
src/master/validation.cpp 9d128aa1b349b018b8e4a1916434d848761ca051

Diff: https://reviews.apache.org/r/35702/diff/

Testing
-------

`make check`

Thanks,

Michael Park

Re: Review Request 35702: Added /reserve HTTP endpoint to the master.

Reply via email to