On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <[email protected]> wrote:
> I'd appreciate feedback on a proposal for a simulation tool for debugging
> and testing the Mesos master and allocator.

Overall, this is awesome! I'd love to see Mesos improve in this area,
and I'd be happy to help out where I can.

> Simulations would--randomly but deterministically--explore the state space
> of cloud configurations and check for invariant violations and collect
> stats--in addition to those already in the Mesos master code.

It would be useful to be able to (a) record a "trace" from a running
(production) Mesos instance (b) replay that trace under the simulator,
e.g., to explore the impact of changes to Mesos. For example, see
Section 3.1 of the Borg paper [1].

> * Automated transformation of Mesos source code for integration into the
> simulator, to allow the simulator to use simulated time instead of real
> time and to intercept libprocess-based inter-thread and inter-node
> communication.

Can you elaborate on how you see the source code transformation working?

Because of the way in which Mesos uses processes and message passing,
you can already control timeouts and inter-process communication in a
fairly sophisticated way -- for example, see Clock::advance(),
Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think
it would be possible to implement the simulator in a way that
leverages (and improves!) the existing facilities in libprocess,
rather than building new functionality? For example, to control the
way in which processes and events are interleaved, would it be
possible to do this by hooking into the libprocess message dispatch
logic, rather than doing a source code transformation?

> Examples of problems to be detected:
> * Liveness problems such as deadlock, livelock, starvation
> * Safety problems such as oversubscription of resources, permanent loss of
> resources or tasks, data corruption in general.
> * Fairness problems such as sustained imbalance in allocation of resources
> to frameworks.
> * Performance problems such as high response time, low resource utilization.

Validating that the system behaves correctly in the presence of
network partitions would also be great.

To clarify, it seems like you are primarily focused on finding
bugs/problems in core Mesos, rather than in Mesos framework
implementations. The latter would also be a very interesting project
(e.g., as a framework author, we'd give you a tool that would push
your scheduler/executor implementation through the entire state space
of situations the framework would need to handle).

Neil

[1] 
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

Reply via email to