I'd appreciate feedback on a proposal for a simulation tool for debugging and testing the Mesos master and allocator.
Simulations would--randomly but deterministically--explore the state space of cloud configurations and check for invariant violations and collect stats--in addition to those already in the Mesos master code. The key difference from test case-driven testing is that the simulations would be driven only by configurations without specifying test cases. The simulator is intended to complement the existing test case-driven testing and help debugging by generating repeatable traces. The main proposed features: * Simulation results are deterministic. All runs with the same parameters will generate identical results regardless of the host system. * Automated transformation of Mesos source code for integration into the simulator, to allow the simulator to use simulated time instead of real time and to intercept libprocess-based inter-thread and inter-node communication. * Flexible tracing capabilities for debugging. Other proposed features: * Speed-up and/or scale-down: The cloud abstraction should allow simulations to take time and resources substantially lower than those on corresponding real cloud configurations. * Framework interface for receiving and responding to offers that allows plugging in heterogeneous framework models, possibly interfacing with other simulation tools such as mesosaurus. Examples of problems to be detected: * Liveness problems such as deadlock, livelock, starvation * Safety problems such as oversubscription of resources, permanent loss of resources or tasks, data corruption in general. * Fairness problems such as sustained imbalance in allocation of resources to frameworks. * Performance problems such as high response time, low resource utilization. Rough sketch of how the proposed simulator would work: * Automatic transformation of Mesos source code to replace or intercept classes and functions related to real time (e.g., Duration and Timeout) and inter-thread and inter-node communication (libprocess) to give the simulator control of timing and event interleaving. * The simulator would add invariant checks (e.g., slave oversubscription, deadlock, framework starvation), and statistics (e.g., resource utilization, framework resource usage) to those already in the Mesos master code. * A simulation run would use a pseudo random sequence to control the interleaving of events (i.e., inter-thread and inter-node communication) to randomly explore the simulated cloud state space including the Mesos master state. * In general the idea is to randomly explore billions of states (or trillions on multiple machines in parallel) without having to identify specific test cases beforehand--other than which configurations to explore. I'd appreciate any comments, suggestions, or info about related tools. -Maged Maged Michael (IBM)