I'd appreciate feedback on a proposal for a simulation tool for debugging
and testing the Mesos master and allocator.

Simulations would--randomly but deterministically--explore the state space
of cloud configurations and check for invariant violations and collect
stats--in addition to those already in the Mesos master code.

The key difference from test case-driven testing is that the simulations
would be driven only by configurations without specifying test cases. The
simulator is intended to complement the existing test case-driven testing
and help debugging by generating repeatable traces.

The main proposed features:
* Simulation results are deterministic. All runs with the same parameters
will generate identical results regardless of the host system.
* Automated transformation of Mesos source code for integration into the
simulator, to allow the simulator to use simulated time instead of real
time and to intercept libprocess-based inter-thread and inter-node
communication.
* Flexible tracing capabilities for debugging.

Other proposed features:
* Speed-up and/or scale-down: The cloud abstraction should allow
simulations to take time and resources substantially lower than those on
corresponding real cloud configurations.
* Framework interface for receiving and responding to offers that allows
plugging in heterogeneous framework models, possibly interfacing with other
simulation tools such as mesosaurus.

Examples of problems to be detected:
* Liveness problems such as deadlock, livelock, starvation
* Safety problems such as oversubscription of resources, permanent loss of
resources or tasks, data corruption in general.
* Fairness problems such as sustained imbalance in allocation of resources
to frameworks.
* Performance problems such as high response time, low resource utilization.

Rough sketch of how the proposed simulator would work:
* Automatic transformation of Mesos source code to replace or intercept
classes and functions related to real time (e.g., Duration and Timeout) and
inter-thread and inter-node communication (libprocess) to give the
simulator control of timing and event interleaving.
* The simulator would add invariant checks (e.g., slave oversubscription,
deadlock, framework starvation), and statistics (e.g., resource
utilization, framework resource usage) to those already in the Mesos master
code.
* A simulation run would use a pseudo random sequence to control the
interleaving of events (i.e., inter-thread and inter-node communication) to
randomly explore the simulated cloud state space including the Mesos master
state.
* In general the idea is to randomly explore billions of states (or
trillions on multiple machines in parallel) without having to identify
specific test cases beforehand--other than which configurations to explore.

I'd appreciate any comments, suggestions, or info about related tools.

-Maged

Maged Michael (IBM)

Reply via email to