Need inputs on running MPI jobs on Mesos

2016-10-13 Thread Mangirish Wagle
Hello Mesos Devs,

I am contributing to Apache Airavata  and
currently working on extending the support for the science gateways to run
MPI jobs on cloud based Mesos clusters.

I am looking at mpiexec-mesos
 and Mesos Hydra
 but I am also interested in
knowing about any latest work that is being done in this area. In general,
I want to seek your advice and thoughts on what is the right tool that I
should use, and the appropriate direction to proceed to achieve the
objective of running MPI jobs on Mesos.

Thank you.

Regards,
Mangirish Wagle
Graduate Student, Indiana University Bloomington.


Re: A Plan for Mesos Community Syncs

2016-10-13 Thread Till Toenshoff
+1 - Thanks MPark!

> On Oct 13, 2016, at 10:34 PM, Michael Park  wrote:
> 
> I would like to try to get the community syncs back on track. They have not
> been organized well recently, and I would like to take ownership of being
> the driver/host of the meetings. I think my half-involved driving of the
> meetings have been detrimental in terms of logistics, consistency,
> experience and ultimately the attendance, and I apologize for this. I also
> think that it's been difficult to rotate between the timezones without
> clear designation of hosts and an inability of many to attend across
> timezones.
> 
> Going forward, we'll have a community sync every other Thursday at 3pm PST,
> starting Oct 20. This is to facilitate getting back to a good cadence with
> consistent hosting (that's on me), with an agenda and regular attendance.
> We can certainly seek for others who can help run the meetings in other
> timezones, but I believe this can come later.
> 
> If you disagree, or opposed to any of what I've said above, please let me
> know.
> 
> Thank you,
> 
> MPark



Re: A Plan for Mesos Community Syncs

2016-10-13 Thread Joris Van Remoortere
+1

—
*Joris Van Remoortere*
Mesosphere

On Thu, Oct 13, 2016 at 1:38 PM, Vinod Kone  wrote:

> Huge +1. Thanks for taking ownership of this.
>
> On Thu, Oct 13, 2016 at 1:34 PM, Michael Park  wrote:
>
> > I would like to try to get the community syncs back on track. They have
> not
> > been organized well recently, and I would like to take ownership of being
> > the driver/host of the meetings. I think my half-involved driving of the
> > meetings have been detrimental in terms of logistics, consistency,
> > experience and ultimately the attendance, and I apologize for this. I
> also
> > think that it's been difficult to rotate between the timezones without
> > clear designation of hosts and an inability of many to attend across
> > timezones.
> >
> > Going forward, we'll have a community sync every other Thursday at 3pm
> PST,
> > starting Oct 20. This is to facilitate getting back to a good cadence
> with
> > consistent hosting (that's on me), with an agenda and regular attendance.
> > We can certainly seek for others who can help run the meetings in other
> > timezones, but I believe this can come later.
> >
> > If you disagree, or opposed to any of what I've said above, please let me
> > know.
> >
> > Thank you,
> >
> > MPark
> >
>


Re: Parallel test runner added

2016-10-13 Thread Michael Park
Thanks for pushing this through Benjamin!

I understand if you're unable to attend the community sync on the 20th,
but would you be able to present this as a demo somehow? maybe via a
screencast?

MPark

On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler  wrote:

> Great to see this Benjamin!
>
> Looking forward to seeing the parallel test runner turn green, I'll help
> file tickets under the epic (I see there are a lot of test failures for
> me).
>
> Once we clear the issues and turn it green, shall we make this the default?
> I would be in favor of that.
>
> Ben
>
> On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
> >
> > Hi,
> >
> > Since most tests in the Mesos, libprocess, and stout test suites can
> > be executed in parallel (the exception being some `ROOT` tests with
> > global side effects in Mesos), we recently added a parallel test
> > runner `support/mesos-gtest-runner.py`. This should allow to
> > potentially significantly speed up running of test suites.
> >
> > To enable automatic parallel execution of tests for test targets
> > executed during `make check`, configure Mesos with the option
> > `--enable-parallel-test-execution`. This will configure the test runner
> > to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> > be run in a separate, sequential step.
> >
> > * * *
> >
> > We use the environment variable `TEST_DRIVER` to drive parallel test
> > execution. By setting this variable to an empty string you can
> > temporarily disable configured parallel execution, e.g.,
> >
> > % make check TEST_DRIVER=
> >
> > By setting this environment variable you have control over the test
> > runner itself and its arguments, even without enabling parallel test
> > during `./configure` time. Be aware that many `ROOT` tests cannot be
> > run in parallel.
> >
> >
> > The current settings oversubscribe the machine by running `#cores*1.5`
> > parallel jobs. This was driven by the observation that currently our
> > tests by and large do not make extended use of even a single core.
> > The number of parallel jobs can by controlled with the `-j` flag of
> > the test runner.
> >
> > Since making more use of the machine will likely increase machine load
> > during test execution, running tests in parallel might expose test
> > flakiness. Tests might also fail to run in parallel if testcases e.g.,
> > write data to hardcoded locations or use hardcoded ports. Please file
> > JIRA tickets for such tests if they do not yet exist.
> >
> >
> > There is still some work needed to improve reporting from parallel
> > tests. We currently use a very silent mode if tests are running
> > without failures, and just report the logs of failed jobs in case of
> > failure. MESOS-6387 sketches out possible future improvements in this
> > area.
> >
> >
> > Happy testing,
> >
> > Benjamin with help from Kevin & Till
> >
> >
>


Re: Parallel test runner added

2016-10-13 Thread Benjamin Mahler
Great to see this Benjamin!

Looking forward to seeing the parallel test runner turn green, I'll help
file tickets under the epic (I see there are a lot of test failures for me).

Once we clear the issues and turn it green, shall we make this the default?
I would be in favor of that.

Ben

On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

>
> Hi,
>
> Since most tests in the Mesos, libprocess, and stout test suites can
> be executed in parallel (the exception being some `ROOT` tests with
> global side effects in Mesos), we recently added a parallel test
> runner `support/mesos-gtest-runner.py`. This should allow to
> potentially significantly speed up running of test suites.
>
> To enable automatic parallel execution of tests for test targets
> executed during `make check`, configure Mesos with the option
> `--enable-parallel-test-execution`. This will configure the test runner
> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> be run in a separate, sequential step.
>
> * * *
>
> We use the environment variable `TEST_DRIVER` to drive parallel test
> execution. By setting this variable to an empty string you can
> temporarily disable configured parallel execution, e.g.,
>
> % make check TEST_DRIVER=
>
> By setting this environment variable you have control over the test
> runner itself and its arguments, even without enabling parallel test
> during `./configure` time. Be aware that many `ROOT` tests cannot be
> run in parallel.
>
>
> The current settings oversubscribe the machine by running `#cores*1.5`
> parallel jobs. This was driven by the observation that currently our
> tests by and large do not make extended use of even a single core.
> The number of parallel jobs can by controlled with the `-j` flag of
> the test runner.
>
> Since making more use of the machine will likely increase machine load
> during test execution, running tests in parallel might expose test
> flakiness. Tests might also fail to run in parallel if testcases e.g.,
> write data to hardcoded locations or use hardcoded ports. Please file
> JIRA tickets for such tests if they do not yet exist.
>
>
> There is still some work needed to improve reporting from parallel
> tests. We currently use a very silent mode if tests are running
> without failures, and just report the logs of failed jobs in case of
> failure. MESOS-6387 sketches out possible future improvements in this
> area.
>
>
> Happy testing,
>
> Benjamin with help from Kevin & Till
>
>


Re: Parallel test runner added

2016-10-13 Thread Alex Rukletsov
This is great, Benjamin!

I've used it the whole day today and it is awesome. (It will become
insanely great once MESOS-6387 is resolved.)

Thanks for everyone who made this happen, also on behalf of my employer : )

Alex.

On Thu, Oct 13, 2016 at 11:28 PM, Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

>
> Hi,
>
> Since most tests in the Mesos, libprocess, and stout test suites can
> be executed in parallel (the exception being some `ROOT` tests with
> global side effects in Mesos), we recently added a parallel test
> runner `support/mesos-gtest-runner.py`. This should allow to
> potentially significantly speed up running of test suites.
>
> To enable automatic parallel execution of tests for test targets
> executed during `make check`, configure Mesos with the option
> `--enable-parallel-test-execution`. This will configure the test runner
> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> be run in a separate, sequential step.
>
> * * *
>
> We use the environment variable `TEST_DRIVER` to drive parallel test
> execution. By setting this variable to an empty string you can
> temporarily disable configured parallel execution, e.g.,
>
> % make check TEST_DRIVER=
>
> By setting this environment variable you have control over the test
> runner itself and its arguments, even without enabling parallel test
> during `./configure` time. Be aware that many `ROOT` tests cannot be
> run in parallel.
>
>
> The current settings oversubscribe the machine by running `#cores*1.5`
> parallel jobs. This was driven by the observation that currently our
> tests by and large do not make extended use of even a single core.
> The number of parallel jobs can by controlled with the `-j` flag of
> the test runner.
>
> Since making more use of the machine will likely increase machine load
> during test execution, running tests in parallel might expose test
> flakiness. Tests might also fail to run in parallel if testcases e.g.,
> write data to hardcoded locations or use hardcoded ports. Please file
> JIRA tickets for such tests if they do not yet exist.
>
>
> There is still some work needed to improve reporting from parallel
> tests. We currently use a very silent mode if tests are running
> without failures, and just report the logs of failed jobs in case of
> failure. MESOS-6387 sketches out possible future improvements in this
> area.
>
>
> Happy testing,
>
> Benjamin with help from Kevin & Till
>
>


Parallel test runner added

2016-10-13 Thread Benjamin Bannier

Hi,

Since most tests in the Mesos, libprocess, and stout test suites can
be executed in parallel (the exception being some `ROOT` tests with
global side effects in Mesos), we recently added a parallel test
runner `support/mesos-gtest-runner.py`. This should allow to
potentially significantly speed up running of test suites.

To enable automatic parallel execution of tests for test targets
executed during `make check`, configure Mesos with the option
`--enable-parallel-test-execution`. This will configure the test runner
to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
be run in a separate, sequential step.

* * *

We use the environment variable `TEST_DRIVER` to drive parallel test
execution. By setting this variable to an empty string you can
temporarily disable configured parallel execution, e.g.,

% make check TEST_DRIVER=

By setting this environment variable you have control over the test
runner itself and its arguments, even without enabling parallel test
during `./configure` time. Be aware that many `ROOT` tests cannot be
run in parallel.


The current settings oversubscribe the machine by running `#cores*1.5`
parallel jobs. This was driven by the observation that currently our
tests by and large do not make extended use of even a single core.
The number of parallel jobs can by controlled with the `-j` flag of
the test runner.

Since making more use of the machine will likely increase machine load
during test execution, running tests in parallel might expose test
flakiness. Tests might also fail to run in parallel if testcases e.g.,
write data to hardcoded locations or use hardcoded ports. Please file
JIRA tickets for such tests if they do not yet exist.


There is still some work needed to improve reporting from parallel
tests. We currently use a very silent mode if tests are running
without failures, and just report the logs of failed jobs in case of
failure. MESOS-6387 sketches out possible future improvements in this
area.


Happy testing,

Benjamin with help from Kevin & Till



Re: A Plan for Mesos Community Syncs

2016-10-13 Thread Vinod Kone
Huge +1. Thanks for taking ownership of this.

On Thu, Oct 13, 2016 at 1:34 PM, Michael Park  wrote:

> I would like to try to get the community syncs back on track. They have not
> been organized well recently, and I would like to take ownership of being
> the driver/host of the meetings. I think my half-involved driving of the
> meetings have been detrimental in terms of logistics, consistency,
> experience and ultimately the attendance, and I apologize for this. I also
> think that it's been difficult to rotate between the timezones without
> clear designation of hosts and an inability of many to attend across
> timezones.
>
> Going forward, we'll have a community sync every other Thursday at 3pm PST,
> starting Oct 20. This is to facilitate getting back to a good cadence with
> consistent hosting (that's on me), with an agenda and regular attendance.
> We can certainly seek for others who can help run the meetings in other
> timezones, but I believe this can come later.
>
> If you disagree, or opposed to any of what I've said above, please let me
> know.
>
> Thank you,
>
> MPark
>


A Plan for Mesos Community Syncs

2016-10-13 Thread Michael Park
I would like to try to get the community syncs back on track. They have not
been organized well recently, and I would like to take ownership of being
the driver/host of the meetings. I think my half-involved driving of the
meetings have been detrimental in terms of logistics, consistency,
experience and ultimately the attendance, and I apologize for this. I also
think that it's been difficult to rotate between the timezones without
clear designation of hosts and an inability of many to attend across
timezones.

Going forward, we'll have a community sync every other Thursday at 3pm PST,
starting Oct 20. This is to facilitate getting back to a good cadence with
consistent hosting (that's on me), with an agenda and regular attendance.
We can certainly seek for others who can help run the meetings in other
timezones, but I believe this can come later.

If you disagree, or opposed to any of what I've said above, please let me
know.

Thank you,

MPark


Re: On Mesos versioning and deprecation policy

2016-10-13 Thread haosdent
>How about splitting the unnamed version into explicit v0, v2, and internal?

Currently our internal protobuf and v0 protobuf use the same unnamed
version protobuf and under the same namespace (`package mesos`).
If we are going to split v0 and internal, that requires copy all protobuf
files under `package mesos` into `package mesos.internal` and need to
change the whole code base to use the protobuf in `package mesos.internal`.
But it is beneficial to do this, so that we could avoid [the hacks][1]
that convert from the unversioned protobuf(v0) to the unversioned
protobuf(internal).

[1]
https://github.com/apache/mesos/blob/fa976c22ac66ff5c905157a5a36bda1d21525b32/src/master/master.cpp#L4077-L4108

On Thu, Oct 13, 2016 at 12:34 AM, Alex Rukletsov 
wrote:

> Folks,
>
> There have been a bunch of online [1, 2] and offline discussions about our
> deprecation and versioning policy. I found that people—including
> myself—read the versioning doc [3] differently; moreover some aspects are
> not captured there. I would like to start a discussion around this topic by
> sharing my confusions and suggestions. This will hopefully help us stay on
> the same page and have similar expectations. The second goal is to
> eliminate ambiguities from the versioning doc (thanks Vinod for
> volunteering to update it).
>
> 1. API vs. semantic changes.
> Current versioning guide treat features (e.g. flags, metrics, endpoints)
> and API differently: incompatible changes for the former are allowed after
> 6 month deprecation cycle, while for the latter they require bumping a
> major version. I suggest we consolidate these policies.
>
> We should also define and clearly explain what changes require bumping the
> major version. I have no strong opinion here and would love to hear what
> people think. The original motivation for maintaining backwards
> compatibility is to make sure vN schedulers can correctly work with vN API
> without being updated. But what about semantic changes that do not touch
> the API? For example, what if we decide to send less task health updates to
> schedulers based on some health policy? It influences the flow of task
> status updates, should such change be considered compatible? Taking it to
> an extreme, we may not even be able to fix some bugs because someone may
> already rely on this behaviour!
>
> Another tightly related thing we should explicitly call out is
> upgradability and rollback capabilities inside a major release. Committing
> to this may significantly limit what we can change within a major release;
> on the other side it will give users more time and a better experience
> about using and maintaining Mesos clusters.
>
> 2. Versioned vs. unversioned protobufs.
> Currently we have v1 and unnamed protobufs, which simultaneously mean v0,
> v2, and internal. I am sometimes confused about what is the right way to
> update or introduce a field or message there, do people feel the same? How
> about splitting the unnamed version into explicit v0, v2, and internal?
>
> Food for thought. It would be great if we can only maintain "diffs" to the
> internal protobufs in the code, instead of duplicating them altogether.
>
> 3. API and feature labelling.
> I suggest to introduce explicit labels for API and features, to ensure
> users have the right assumptions about the their lifetime while engineers
> have the ability to change a wip feature in an non-compatible way. I
> propose the following:
> API: stable, non-stable, pure (not used by Mesos components)
> Feature: experimental, normal.
>
> Looking forward to your thoughts and suggestions.
> AlexR
>
> [1] https://www.mail-archive.com/user@mesos.apache.org/msg08025.html
> [2] https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html
> [3]
> https://github.com/apache/mesos/blob/b2beef37f6f85a8c75e968136caa7a
> 1f292ba20e/docs/versioning.md
>



-- 
Best Regards,
Haosdent Huang


Re: Allowing both CommandInfo and ExecutorInfo on TaskInfo

2016-10-13 Thread haosdent
For command task, if its `ExecutorInfo` would set with `CommandExecutor` as
well?

Some tickets may relate to this.

[1]: https://issues.apache.org/jira/browse/MESOS-2330
[2]: https://issues.apache.org/jira/browse/MESOS-527
[3]: https://issues.apache.org/jira/browse/MESOS-5198

On Fri, Oct 14, 2016 at 1:00 AM, Vinod Kone  wrote:

> Hi,
>
> We are contemplating whether to allow both CommandInfo and ExecutorInfo on
> TaskInfo (MESOS-6294 ).
> Currently we only allow one or the other. The motivation is to allow custom
> executors a more structured way to pass information (e.g, command) about
> Task. Right now custom executors have to get this data via `TaskInfo.bytes`
> which is not ideal.
>
> Are there any custom executors out there that crash if they get Tasks with
> CommandInfo set?
>
> Thoughts?
>
> Vinod
>



-- 
Best Regards,
Haosdent Huang


Allowing both CommandInfo and ExecutorInfo on TaskInfo

2016-10-13 Thread Vinod Kone
Hi,

We are contemplating whether to allow both CommandInfo and ExecutorInfo on
TaskInfo (MESOS-6294 ).
Currently we only allow one or the other. The motivation is to allow custom
executors a more structured way to pass information (e.g, command) about
Task. Right now custom executors have to get this data via `TaskInfo.bytes`
which is not ideal.

Are there any custom executors out there that crash if they get Tasks with
CommandInfo set?

Thoughts?

Vinod