On Mon, Oct 5, 2015 at 2:41 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > +1 > > Likewise, I think it's awesome, would love to be involved.
Thanks. That would be awesome! > *Marco Massenzio* --Maged > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > > On Mon, Oct 5, 2015 at 10:50 AM, Neil Conway <neil.con...@gmail.com> wrote: > >> On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <maged.mich...@gmail.com> >> wrote: >> > I'd appreciate feedback on a proposal for a simulation tool for debugging >> > and testing the Mesos master and allocator. >> >> Overall, this is awesome! I'd love to see Mesos improve in this area, >> and I'd be happy to help out where I can. >> >> > Simulations would--randomly but deterministically--explore the state >> space >> > of cloud configurations and check for invariant violations and collect >> > stats--in addition to those already in the Mesos master code. >> >> It would be useful to be able to (a) record a "trace" from a running >> (production) Mesos instance (b) replay that trace under the simulator, >> e.g., to explore the impact of changes to Mesos. For example, see >> Section 3.1 of the Borg paper [1]. >> >> > * Automated transformation of Mesos source code for integration into the >> > simulator, to allow the simulator to use simulated time instead of real >> > time and to intercept libprocess-based inter-thread and inter-node >> > communication. >> >> Can you elaborate on how you see the source code transformation working? >> >> Because of the way in which Mesos uses processes and message passing, >> you can already control timeouts and inter-process communication in a >> fairly sophisticated way -- for example, see Clock::advance(), >> Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think >> it would be possible to implement the simulator in a way that >> leverages (and improves!) the existing facilities in libprocess, >> rather than building new functionality? For example, to control the >> way in which processes and events are interleaved, would it be >> possible to do this by hooking into the libprocess message dispatch >> logic, rather than doing a source code transformation? >> >> > Examples of problems to be detected: >> > * Liveness problems such as deadlock, livelock, starvation >> > * Safety problems such as oversubscription of resources, permanent loss >> of >> > resources or tasks, data corruption in general. >> > * Fairness problems such as sustained imbalance in allocation of >> resources >> > to frameworks. >> > * Performance problems such as high response time, low resource >> utilization. >> >> Validating that the system behaves correctly in the presence of >> network partitions would also be great. >> >> To clarify, it seems like you are primarily focused on finding >> bugs/problems in core Mesos, rather than in Mesos framework >> implementations. The latter would also be a very interesting project >> (e.g., as a framework author, we'd give you a tool that would push >> your scheduler/executor implementation through the entire state space >> of situations the framework would need to handle). >> >> Neil >> >> [1] >> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf >>