Re: Proposing a deterministic simulation tool for Mesos master and allocator debugging and testing

Khalid Ahmed Mon, 05 Oct 2015 15:29:37 -0700

‎Great interaction with the community! I notice you use your Gmail. Is it
Platform to put in your signature that you work for IBM? We want to get IBM
credit in the community for your work.


Sent from my BlackBerry 10 smartphone.
  Original Message
From: Maged Michael
Sent: Monday, October 5, 2015 6:21 PM
To: dev@mesos.apache.org
Reply To: dev@mesos.apache.org
Subject: Re: Proposing a deterministic simulation tool for Mesos master and
allocator debugging and testing

On Mon, Oct 5, 2015 at 2:41 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:
> +1
>
> Likewise, I think it's awesome, would love to be involved.

Thanks. That would be awesome!

> *Marco Massenzio*

--Maged

> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Mon, Oct 5, 2015 at 10:50 AM, Neil Conway <neil.con...@gmail.com>
wrote:
>
>> On Sun, Oct 4, 2015 at 6:14 PM, Maged Michael <maged.mich...@gmail.com>
>> wrote:
>> > I'd appreciate feedback on a proposal for a simulation tool for
debugging
>> > and testing the Mesos master and allocator.
>>
>> Overall, this is awesome! I'd love to see Mesos improve in this area,
>> and I'd be happy to help out where I can.
>>
>> > Simulations would--randomly but deterministically--explore the state
>> space
>> > of cloud configurations and check for invariant violations and collect
>> > stats--in addition to those already in the Mesos master code.
>>
>> It would be useful to be able to (a) record a "trace" from a running
>> (production) Mesos instance (b) replay that trace under the simulator,
>> e.g., to explore the impact of changes to Mesos. For example, see
>> Section 3.1 of the Borg paper [1].
>>
>> > * Automated transformation of Mesos source code for integration into
the
>> > simulator, to allow the simulator to use simulated time instead of
real
>> > time and to intercept libprocess-based inter-thread and inter-node
>> > communication.
>>
>> Can you elaborate on how you see the source code transformation working?
>>
>> Because of the way in which Mesos uses processes and message passing,
>> you can already control timeouts and inter-process communication in a
>> fairly sophisticated way -- for example, see Clock::advance(),
>> Clock::settle(), FUTURE_MESSAGE(), DROP_MESSAGE(), etc. Do you think
>> it would be possible to implement the simulator in a way that
>> leverages (and improves!) the existing facilities in libprocess,
>> rather than building new functionality? For example, to control the
>> way in which processes and events are interleaved, would it be
>> possible to do this by hooking into the libprocess message dispatch
>> logic, rather than doing a source code transformation?
>>
>> > Examples of problems to be detected:
>> > * Liveness problems such as deadlock, livelock, starvation
>> > * Safety problems such as oversubscription of resources, permanent
loss
>> of
>> > resources or tasks, data corruption in general.
>> > * Fairness problems such as sustained imbalance in allocation of
>> resources
>> > to frameworks.
>> > * Performance problems such as high response time, low resource
>> utilization.
>>
>> Validating that the system behaves correctly in the presence of
>> network partitions would also be great.
>>
>> To clarify, it seems like you are primarily focused on finding
>> bugs/problems in core Mesos, rather than in Mesos framework
>> implementations. The latter would also be a very interesting project
>> (e.g., as a framework author, we'd give you a tool that would push
>> your scheduler/executor implementation through the entire state space
>> of situations the framework would need to handle).
>>
>> Neil
>>
>> [1]
>>
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

>>

Re: Proposing a deterministic simulation tool for Mesos master and allocator debugging and testing

Reply via email to