+1 Likewise, I think it's awesome, would love to be involved.
*Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Mon, Oct 5, 2015 at 10:50 AM, Neil Conway <[email protected]> wrote: > On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <[email protected]> > wrote: > > I'd appreciate feedback on a proposal for a simulation tool for debugging > > and testing the Mesos master and allocator. > > Overall, this is awesome! I'd love to see Mesos improve in this area, > and I'd be happy to help out where I can. > > > Simulations would--randomly but deterministically--explore the state > space > > of cloud configurations and check for invariant violations and collect > > stats--in addition to those already in the Mesos master code. > > It would be useful to be able to (a) record a "trace" from a running > (production) Mesos instance (b) replay that trace under the simulator, > e.g., to explore the impact of changes to Mesos. For example, see > Section 3.1 of the Borg paper [1]. > > > * Automated transformation of Mesos source code for integration into the > > simulator, to allow the simulator to use simulated time instead of real > > time and to intercept libprocess-based inter-thread and inter-node > > communication. > > Can you elaborate on how you see the source code transformation working? > > Because of the way in which Mesos uses processes and message passing, > you can already control timeouts and inter-process communication in a > fairly sophisticated way -- for example, see Clock::advance(), > Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think > it would be possible to implement the simulator in a way that > leverages (and improves!) the existing facilities in libprocess, > rather than building new functionality? For example, to control the > way in which processes and events are interleaved, would it be > possible to do this by hooking into the libprocess message dispatch > logic, rather than doing a source code transformation? > > > Examples of problems to be detected: > > * Liveness problems such as deadlock, livelock, starvation > > * Safety problems such as oversubscription of resources, permanent loss > of > > resources or tasks, data corruption in general. > > * Fairness problems such as sustained imbalance in allocation of > resources > > to frameworks. > > * Performance problems such as high response time, low resource > utilization. > > Validating that the system behaves correctly in the presence of > network partitions would also be great. > > To clarify, it seems like you are primarily focused on finding > bugs/problems in core Mesos, rather than in Mesos framework > implementations. The latter would also be a very interesting project > (e.g., as a framework author, we'd give you a tool that would push > your scheduler/executor implementation through the entire state space > of situations the framework would need to handle). > > Neil > > [1] > https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf >
