Re: Proposing a deterministic simulation tool for Mesos master and allocator debugging and testing

Maged Michael Mon, 05 Oct 2015 15:21:44 -0700

On Mon, Oct 5, 2015 at 2:41 PM, Marco Massenzio <ma...@mesosphere.io> wrote:
> +1
>
> Likewise, I think it's awesome, would love to be involved.


Thanks. That would be awesome!

> *Marco Massenzio*

--Maged

> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Mon, Oct 5, 2015 at 10:50 AM, Neil Conway <neil.con...@gmail.com> wrote:
>
>> On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <maged.mich...@gmail.com>
>> wrote:
>> > I'd appreciate feedback on a proposal for a simulation tool for debugging
>> > and testing the Mesos master and allocator.
>>
>> Overall, this is awesome! I'd love to see Mesos improve in this area,
>> and I'd be happy to help out where I can.
>>
>> > Simulations would--randomly but deterministically--explore the state
>> space
>> > of cloud configurations and check for invariant violations and collect
>> > stats--in addition to those already in the Mesos master code.
>>
>> It would be useful to be able to (a) record a "trace" from a running
>> (production) Mesos instance (b) replay that trace under the simulator,
>> e.g., to explore the impact of changes to Mesos. For example, see
>> Section 3.1 of the Borg paper [1].
>>
>> > * Automated transformation of Mesos source code for integration into the
>> > simulator, to allow the simulator to use simulated time instead of real
>> > time and to intercept libprocess-based inter-thread and inter-node
>> > communication.
>>
>> Can you elaborate on how you see the source code transformation working?
>>
>> Because of the way in which Mesos uses processes and message passing,
>> you can already control timeouts and inter-process communication in a
>> fairly sophisticated way -- for example, see Clock::advance(),
>> Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think
>> it would be possible to implement the simulator in a way that
>> leverages (and improves!) the existing facilities in libprocess,
>> rather than building new functionality? For example, to control the
>> way in which processes and events are interleaved, would it be
>> possible to do this by hooking into the libprocess message dispatch
>> logic, rather than doing a source code transformation?
>>
>> > Examples of problems to be detected:
>> > * Liveness problems such as deadlock, livelock, starvation
>> > * Safety problems such as oversubscription of resources, permanent loss
>> of
>> > resources or tasks, data corruption in general.
>> > * Fairness problems such as sustained imbalance in allocation of
>> resources
>> > to frameworks.
>> > * Performance problems such as high response time, low resource
>> utilization.
>>
>> Validating that the system behaves correctly in the presence of
>> network partitions would also be great.
>>
>> To clarify, it seems like you are primarily focused on finding
>> bugs/problems in core Mesos, rather than in Mesos framework
>> implementations. The latter would also be a very interesting project
>> (e.g., as a framework author, we'd give you a tool that would push
>> your scheduler/executor implementation through the entire state space
>> of situations the framework would need to handle).
>>
>> Neil
>>
>> [1]
>> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf
>>

Re: Proposing a deterministic simulation tool for Mesos master and allocator debugging and testing

Reply via email to