Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Neil Conway
On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin 
wrote:

> AFAIK the absence of TASK_LOST statuses is expected. Master registry
> persists information only about agents. Tasks are recovered from
> re-registering agents. Because of that the failed over master can't send
> TASK_LOST for tasks that were running on the agent that didn't re-register,
> it simply doesn't know about them. The only thing the master can do in this
> situation is send LostSlaveMessage that will tell the scheduler that tasks
> on this agent are LOST/UNREACHABLE.
>

+1.

The situation where the agent came back after reregistration timeout
> doesn't sound good. The only way for the framework to learn about tasks
> that are still running on such agent is either from status updates or via
> implicit reconciliation. Perhaps, the master could send updates for tasks
> it learned about when such agent is readmitted?
>

I agree this would be a good idea:
https://issues.apache.org/jira/browse/MESOS-6406

I haven't had a chance to implement it yet, but if someone is interested, I
think this would be a pretty nicely scoped project.

Neil


Re: June 3rd: MesosCon North America CFP due!

2017-06-03 Thread Neil Conway
Hi Jay,

The CFP deadline has been extended to June 30.

Neil

On Sat, Jun 3, 2017 at 12:55 AM Jay Guo  wrote:

> According to the link CFP for MesosCon North America
> 
> ​ the CFP closes by June 30, is it a mistake?
>
> CFP Close: June 30, 2017
> CFP Notifications: July 17, 2017
> Schedule Announced: July 19, 2017
>
> cheers,
> - J​
>
> On Fri, Jun 2, 2017 at 6:45 AM, Judith Malnick 
> wrote:
>
>> Feel free to resubmit a talk from a previous MesosCon. There are
>> different audiences in each location so don't be shy to submit a talk
>> you've already given or a proposal that you've been optimizing since the
>> last time around.
>>
>> Best!
>> Judith
>>
>>
>> On Wed, May 31, 2017 at 2:57 PM, Judith Malnick 
>> wrote:
>>
>>> Hi Apache Mesos Users and Devs,
>>>
>>> This is a friendly reminder that the CFP for MesosCon North America
>>> 
>>> will close:
>>>
>>> *Saturday, June 3rd*!
>>>
>>> If you've been working on a talk proposal, please put the finishing
>>> touches on it and send it in. The reviewers are really excited to see
>>> everyone's ideas; don't be shy.
>>>
>>> If you have any questions feel free to reach out to me (
>>> jmaln...@mesosphere.com) or Kiersten Gaffney (kiers...@mesosphere.io).
>>>
>>> All the best!
>>> Judith
>>>
>>> --
>>> Judith Malnick
>>> DC/OS Community Manager
>>> 310-709-1517 <(310)%20709-1517>
>>>
>>
>>
>>
>> --
>> Judith Malnick
>> DC/OS Community Manager
>> 310-709-1517
>>
>
>


Re: RFC: Partition Awareness

2017-06-01 Thread Neil Conway
Hi Ben,

The argument for changing the semantics is that correct frameworks
should _always_ have accounted for the possibility that TASK_LOST
tasks would go back to running (due to the non-strict registry
semantics). The proposed change would just increase the probability of
this behavior occurring. From a certain POV, this change would
actually make it easier to write correct frameworks because the
TASK_LOST scenario will be less of a corner case :)

Implementing the task-killing behavior is a bit tricky, because the
task might continue to run on the agent for a considerable period of
time. During that time, we can either:

(a) omit the being-killed task from the master's memory (current
behavior). That means that any resources used by the task appear to be
unused, so there might be a concurrent task launch that attempts to
use them and fails.

(b) track the being-killed task in the master's memory. This ensures
the task's resources are not re-offered until the task is actually
terminated. The concern here is that this "being-killed" task is in a
weird state -- what task status should it have? When it finally dies,
we don't want to report a terminal status update back to frameworks
(for backward compatibility).

Neither of those approaches seemed ideal, hence we are wondering
whether we really need to implement this backward compatibility
behavior in the first place.

Neil

On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:
> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>
> It seems like we should just directly fix the issue, do you have a sense of
> what the difficulty is there? Is it the re-use of the existing framework
> shutdown message to kill the tasks that makes this problematic?
>
> On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:
>>
>> Hi All,
>>
>> We are working on fixing a potential issue MESOS-7215 with partition
>> awareness which happens when an unreachable agent, with tasks for
>> non-Partition Aware frameworks, attempts to re-register with the master.
>> Before the support for partition-aware frameworks, which was introduced in
>> Mesos 1.1.0 MESOS-5344,  if an agent partitioned from the master attempted
>> to re-register, then it will be shut down and all the tasks on the agent
>> would be terminated. With this feature, the partitioned agents were no
>> longer shut down by the master when they re-registered but to keep the old
>> behavior the tasks on these agents were still shutdown if the corresponding
>> framework didn’t opt-in to partition awareness.
>>
>> One of the possible solutions to address the issue mentioned in MESOS-7215
>> is to change master’s behavior to not kill the tasks for non-Partition aware
>> frameworks when an unreachable agent re-registers with the master. When an
>> agent goes unreachable i.e. fails the masters health check ping for
>> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
>> all the tasks on this agent which have been launched by non-Partition Aware
>> frameworks. So, if such tasks are no longer killed by the master then upon
>> agent re-registration the frameworks will see a non-terminal status updates
>> for tasks for which they already received a TASK_LOST.
>> This change will hopefully not break any schedulers since it could have
>> happened in the past with non-strict registry as well and schedulers are
>> expected to be resilient enough to handle this scenario.
>>
>> For the proposed solution we wanted to get feedback from the community to
>> ensure that this change doesn’t break or cause any side effects for the
>> schedulers. Looking forward to any feedbacks/comments.
>>
>> Many Thanks
>> Megha
>>
>>
>


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-31 Thread Neil Conway
On Tue, May 30, 2017 at 3:43 PM, Neil Conway <neil.con...@gmail.com> wrote:
> Attached is the test log for this failure. From a quick look, seems as
> though the agent starts to launch the task, including forking the
> child process, but no subsequent task status updates or error messages
> are observed. Gaston, have you seen this before?
>
> I filed https://issues.apache.org/jira/browse/MESOS-7589 to track this.

I wasn't able to repro this failure. Per Gaston's email, there isn't
enough information in the logs to understand what is going on here,
although it certainly seems weird that apparently the executor doesn't
start.

I think this doesn't justify blocking the release, but we should watch
to see if the problem recurs.

Neil


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-31 Thread Neil Conway
On Tue, May 30, 2017 at 2:36 PM, Vinod Kone  wrote:
> Failed test: OneWayPartitionTest.MasterToSlave
> 

I was able to reproduce this locally, and confirmed that it is a flaky
test. Fix here:

https://reviews.apache.org/r/59685/

I don't think this requires spinning a new 1.3.0 RC.

Neil


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-30 Thread Neil Conway
On Tue, May 30, 2017 at 2:36 PM, Vinod Kone  wrote:
> Ran on ASF CI.
>
> Found following issues.
>
> Failed test: CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
> 

Attached is the test log for this failure. From a quick look, seems as
though the agent starts to launch the task, including forking the
child process, but no subsequent task status updates or error messages
are observed. Gaston, have you seen this before?

I filed https://issues.apache.org/jira/browse/MESOS-7589 to track this.

> Failed test: OneWayPartitionTest.MasterToSlave
> 

Looking into this now.

Neil
[ RUN  ] CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
I0525 16:55:27.473031  2250 cluster.cpp:162] Creating default 'local' authorizer
I0525 16:55:27.475637  2280 master.cpp:436] Master 
b97c30ee-bdab-4879-ba55-5d32f822c038 (305d67e5598a) started on 172.17.0.2:40622
I0525 16:55:27.475666  2280 master.cpp:438] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/UqWhgS/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.3.0/_inst/share/mesos/webui" 
--work_dir="/tmp/UqWhgS/master" --zk_session_timeout="10secs"
I0525 16:55:27.475993  2280 master.cpp:488] Master only allowing authenticated 
frameworks to register
I0525 16:55:27.476006  2280 master.cpp:502] Master only allowing authenticated 
agents to register
I0525 16:55:27.476013  2280 master.cpp:515] Master only allowing authenticated 
HTTP frameworks to register
I0525 16:55:27.476022  2280 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/UqWhgS/credentials'
I0525 16:55:27.476297  2280 master.cpp:560] Using default 'crammd5' 
authenticator
I0525 16:55:27.476441  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0525 16:55:27.476702  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0525 16:55:27.476845  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0525 16:55:27.476972  2280 master.cpp:640] Authorization enabled
I0525 16:55:27.477180  2272 whitelist_watcher.cpp:77] No whitelist given
I0525 16:55:27.477191  2270 hierarchical.cpp:158] Initialized hierarchical 
allocator process
I0525 16:55:27.480708  2284 master.cpp:2161] Elected as the leading master!
I0525 16:55:27.480739  2284 master.cpp:1700] Recovering from registrar
I0525 16:55:27.480864  2285 registrar.cpp:345] Recovering registrar
I0525 16:55:27.481547  2285 registrar.cpp:389] Successfully fetched the 
registry (0B) in 645888ns
I0525 16:55:27.481663  2285 registrar.cpp:493] Applied 1 operations in 22833ns; 
attempting to update the registry
I0525 16:55:27.482271  2285 registrar.cpp:550] Successfully updated the 
registry in 556032ns
I0525 16:55:27.482409  2285 registrar.cpp:422] Successfully recovered registrar
I0525 16:55:27.482946  2277 hierarchical.cpp:185] Skipping recovery of 
hierarchical allocator: nothing to recover
I0525 16:55:27.482945  2286 master.cpp:1799] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0525 16:55:27.487871  2250 containerizer.cpp:221] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
W0525 16:55:27.488440  2250 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges

Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Neil Conway
On Tue, May 30, 2017 at 12:58 PM, Michael Park  wrote:
> I'm all for moving to GCC 4.9+.
>
> I'd love to get C++14 and bump to GCC 5, but I think we should do an
> investigation for "reasonable availability" before we do that.

I agree, although I'd think a similar investigation is required to
move to GCC 4.9+.

To rephrase my previous point, I'd expect that most platforms fall
into two groups: either the system compiler is ancient, in which case
something like devtoolset is required anyway, or the system compiler
is relatively modern, in which case there is no difference between
depending on GCC >= 4.9 vs. GCC >= 5.

Neil


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Neil Conway
It seems that if we moved to GCC 5, we'd also be able to move to C++14
(https://gcc.gnu.org/projects/cxx-status.html#cxx14).

CentOS 6 users will need to install devtoolset anyway (which makes it
easy to get GCC 5 or 6), so I wonder if skipping directly to requiring
GCC 5 would be feasible?

Neil


On Tue, May 30, 2017 at 11:17 AM, Jacob Janco  wrote:
> Along with various additions and optimizations, support for  would be 
> nice to have. Thoughts on this?


Re: [VOTE] Release Apache Mesos 1.3.0 (rc2)

2017-05-24 Thread Neil Conway
The vote has failed; we'll cut a new release shortly.

The release blocker (MESOS-7521) has been investigated and fixed. The
next RC will also include MESOS-7538, as well as the `register_agents`
ACL change mentioned in a different thread.

Neil

On Wed, May 17, 2017 at 3:11 PM, Yan Xu  wrote:
> -1 (binding)
>
> Let's address this blocker first. Neil's looking into it now.
>
> Yan


Re: Welcome Gilbert Song as a new committer and PMC member!

2017-05-24 Thread Neil Conway
Congratulations Gilbert! Well-deserved!

Neil

On Wed, May 24, 2017 at 10:32 AM, Jie Yu  wrote:
> Hi folks,
>
> I' happy to announce that the PMC has voted Gilbert Song as a new committer
> and member of PMC for the Apache Mesos project. Please join me to
> congratulate him!
>
> Gilbert has been working on Mesos project for 1.5 years now. His main
> contribution is his work on unified containerizer, nested container (aka
> Pod) support. He also helped a lot of folks in the community regarding their
> patches, questions and etc. He also played an important role organizing
> MesosCon Asia last year and this year!
>
> His formal committer checklist can be found here:
> https://docs.google.com/document/d/1iSiqmtdX_0CU-YgpViA6r6PU_aMCVuxuNUZ458FR7Qw/edit?usp=sharing
>
> Welcome, Gilbert!
>
> - Jie


Re: Use of ACLs.RegisterAgent.agent

2017-05-24 Thread Neil Conway
FYI, I merged the change to rename this field into the master and
1.3.x branches; it will be included in the next 1.3.0 release
candidate.

Neil


On Mon, May 22, 2017 at 10:43 AM, Alexander Rojas
 wrote:
> Hey guys,
>
> We just noted that there was an error when the `RegisterAgent` act was
> introduced. Namely, its object field is listed as `agent` when by convention
> we have used plural, so it should be `agents`. This ACL hasn’t been part of
> any released version of Mesos, so if no one is using it I will try to push
> for a rename without going through any deprecation cycle.
>
> The big question is if any of you are using this particular ACL in
> production right now?
>
> Alexander Rojas
> alexan...@mesosphere.io
>
>
>
>


Re: mesos git commit: Updated the outdated network isolator configure flag.

2017-05-18 Thread Neil Conway
This commit enables the port mapping isolator by default. Was that
intended? Among other things, it breaks the build on OSX:

$ ../mesos/configure --disable-java --disable-python
[...]
configure: error: cannot build network isolator
---
Network isolator is only supported on Linux!
---

Neil

On Thu, May 18, 2017 at 8:34 AM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 8c564db51 -> 20dee4190
>
>
> Updated the outdated network isolator configure flag.
>
> This patch updated the outdated network isolator configure flag to
> more descriptive port mapping isolator.
>
> Review: https://reviews.apache.org/r/59193/
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/20dee419
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/20dee419
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/20dee419
>
> Branch: refs/heads/master
> Commit: 20dee4190838b5bc7eb9cb524af413e3fe3fe082
> Parents: 8c564db
> Author: Tim Hansen 
> Authored: Wed May 17 14:47:59 2017 -0700
> Committer: Jie Yu 
> Committed: Thu May 18 08:33:54 2017 -0700
>
> --
>  configure.ac| 18 +++---
>  src/Makefile.am |  6 +++---
>  src/master/flags.cpp|  4 ++--
>  src/master/flags.hpp|  4 ++--
>  src/master/master.cpp   |  4 ++--
>  src/slave/constants.hpp |  2 +-
>  src/slave/containerizer/mesos/containerizer.cpp |  4 ++--
>  src/slave/flags.cpp |  4 ++--
>  src/slave/flags.hpp |  2 +-
>  src/tests/environment.cpp   |  8 
>  src/tests/master_tests.cpp  |  4 ++--
>  src/tests/mesos.cpp |  6 +++---
>  12 files changed, 35 insertions(+), 31 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/20dee419/configure.ac
> --
> diff --git a/configure.ac b/configure.ac
> index 8c17307..d523670 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -187,6 +187,11 @@ AC_ARG_ENABLE([debug],
>option won't change them]),
>[], [enable_debug=no])
>
> +AC_ARG_ENABLE([port-mapping-isolator],
> +  AS_HELP_STRING([--enable-port-mapping-isolator],
> + [enable port mapping isolator]),
> +  [], [enable_port_mapping_isolator=yes])
> +
>  AC_ARG_ENABLE([java],
>AS_HELP_STRING([--disable-java],
>   [don't build Java bindings]),
> @@ -334,12 +339,11 @@ AC_ARG_WITH([libprocess],
> [specify where to locate the libprocess library]),
>  [], [])
>
> -# TODO(MESOS-4991): Since network-isolator is an optional feature, it should
> -# be enabled with --enable-network-isolator.
>  AC_ARG_WITH([network-isolator],
>  AS_HELP_STRING([--with-network-isolator],
> [builds the network isolator]),
> -[], [with_network_isolator=no])
> +[AC_MSG_WARN(["--with-network-isolator is being depreciated, 
> please use --enable-port-mapping-isolator instead."])],
> +[enable_port_mapping_isolator=yes])
>
>  AC_ARG_WITH([nl],
>  AS_HELP_STRING([--with-nl=@<:@DIR@:>@],
> @@ -1270,7 +1274,7 @@ AM_CONDITIONAL([WITH_BUNDLED_LIBPROCESS], [test 
> "x$with_bundled_libprocess" = "x
>
>
>  # Perform necessary configuration for network isolator.
> -if test "x$with_network_isolator" = "xyes"; then
> +if test "x$enable_port_mapping_isolator" = "xyes"; then
>if test -n "`echo $with_nl`"; then
>  CPPFLAGS="-I${with_nl}/include/libnl3 $CPPFLAGS"
>  LDFLAGS="-L${with_nl}/lib $LDFLAGS"
> @@ -1342,11 +1346,11 @@ https://github.com/thom311/libnl/releases
>  ---
>])])
>
> -  AC_DEFINE([WITH_NETWORK_ISOLATOR])
> +  AC_DEFINE([ENABLE_PORT_MAPPING_ISOLATOR])
>  fi
>
> -AM_CONDITIONAL([WITH_NETWORK_ISOLATOR],
> -   [test "x$with_network_isolator" = "xyes"])
> +AM_CONDITIONAL([ENABLE_PORT_MAPPING_ISOLATOR],
> +   [test "x$enable_port_mapping_isolator" = "xyes"])
>
>
>  # If the user has asked not to include the bundled NVML headers for
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/20dee419/src/Makefile.am
> --
> diff --git a/src/Makefile.am b/src/Makefile.am
> index 

Re: [VOTE] Release Apache Mesos 1.3.0 (rc1)

2017-05-08 Thread Neil Conway
Personally, I'm not convinced that we need to fix MESOS-7378. The
problem is essentially a bug in glibc that was fixed 6 years ago. (As
a point of reference, the oldest version of g++ we support was
released 2 years ago... :) )

Neil

On Mon, May 8, 2017 at 3:45 PM, Yan Xu  wrote:
> I am still hoping that we get
> https://issues.apache.org/jira/browse/MESOS-7378 fixed before shipping
> 0.13.0. :)
>
> ---
> Jiang Yan Xu  | @xujyan
>
> On Fri, May 5, 2017 at 6:31 PM, Michael Park  wrote:
>>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.3.0.
>>
>>
>> 1.3.0 includes the following:
>>
>> 
>>   - Multi-role framework support
>>   - Executor authentication support
>>   - Allow frameworks to modify their roles.
>>   - Hierarchical roles (*EXPERIMENTAL*)
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.3.0-rc1
>>
>> 
>>
>> The candidate for Mesos 1.3.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc1/mesos-1.3.0.tar.gz
>>
>> The tag to be voted on is 1.3.0-rc1:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.3.0-rc1
>>
>> The MD5 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc1/mesos-1.3.0.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc1/mesos-1.3.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1190
>>
>> Please vote on releasing this package as Apache Mesos 1.3.0!
>>
>> The vote is open until Wed May 10 11:59:59 PDT 2017 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>>
>> MPark & Neil
>
>


Re: Version numbers during development

2017-05-08 Thread Neil Conway
Per offline discussion with mpark, he suggested a preference for
"-dev" rather than "-devel", which sounds fine to me.

Based on the lack of objections, I'll plan on making this change
shortly: (1) changing the version number in master to "1.4.0-dev", and
(2) updating the release process documentation accordingly.

If you have any concerns, please let me know.

Neil


On Fri, May 5, 2017 at 4:29 PM, Neil Conway <neil.con...@gmail.com> wrote:
> In my experience, this is reasonably common. For example, Postgres
> uses version numbers like "9.5devel" to identify the software in the
> development period before the next release process starts.
>
> Neil
>
> On Fri, May 5, 2017 at 4:23 PM, Vinod Kone <vinodk...@apache.org> wrote:
>> Is this a common practice that autotools based projects follow?
>>
>> For publish snapshot JARs to maven for example, we just add "-SNAPSHOT" to
>> the version tag (and hence to the JAR name) before publishing it to maven;
>> without a need to change the version in source control.
>>
>> On Fri, May 5, 2017 at 1:27 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Sent from my iPhone
>>>
>>> > On May 5, 2017, at 12:56 PM, Neil Conway <neil.con...@gmail.com> wrote:
>>> >
>>> > Our current practice is that when we create a branch for version X, we
>>> > bump the version number in the "master" branch to X+1. For example, we
>>> > just created the 1.3.x branch, and bumped the version number in master
>>> > to "1.4.0".
>>> >
>>> > Proposal: we should instead use a version number like "1.4.0-devel" in
>>> > the master branch. When the 1.4.x release branch is created, the first
>>> > commit in that branch would switch to use the "1.4.0" version number.
>>> > Meanwhile, master would be bumped to use "1.5.0-devel".
>>> >
>>> > The main benefit is to make it easier to distinguish official Mesos
>>> > releases from snapshots that are taken from the master branch at some
>>> > point during development. Note that according to SemVer, "1.4.0-devel"
>>> > is considered to be "older" than "1.4.0", which is the behavior we'd
>>> > want.
>>> >
>>> > Neil
>>>


Re: [4/6] mesos git commit: Checked validity of master and agent version numbers on startup.

2017-05-08 Thread Neil Conway
There was an earlier thread on whether to ignore registration attempts
from pre-1.0 Mesos agents:

https://lists.apache.org/thread.html/f5e15f6d7a3f3b08d29e27455e2e1c801775418a148dded953c568e7@%3Cdev.mesos.apache.org%3E

We didn't explicitly discuss what to do when the compiled-in version
of Mesos is not parseable as SemVer. I'll start a separate thread to
let users know about that change.

Neil


On Mon, May 8, 2017 at 10:25 AM, Vinod Kone <vinodk...@apache.org> wrote:
> @Neil: Have we sent an email about this change to dev list? This might
> break people who were directly building off source and using a custom
> version number.
>
> On Fri, May 5, 2017 at 4:54 PM, <ne...@apache.org> wrote:
>
>> Checked validity of master and agent version numbers on startup.
>>
>> During startup, we now check that the compiled-in version number of the
>> master and agent can be parsed by stout's `Version::parse` (i.e., that
>> `MESOS_VERSION` is valid according to stout's extended variant of the
>> Semver 2.0.0 format).
>>
>> Review: https://reviews.apache.org/r/58708
>>
>>
>> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/5a5dd8a4
>> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/5a5dd8a4
>> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/5a5dd8a4
>>
>> Branch: refs/heads/1.2.x
>> Commit: 5a5dd8a4edcb52e2227e2a4607b95a7dcc6aa321
>> Parents: c56851e
>> Author: Neil Conway <neil.con...@gmail.com>
>> Authored: Mon Mar 6 10:55:07 2017 -0800
>> Committer: Neil Conway <neil.con...@gmail.com>
>> Committed: Fri May 5 15:16:38 2017 -0700
>>
>> --
>>  src/master/main.cpp | 11 +++
>>  src/slave/main.cpp  | 11 +++
>>  2 files changed, 22 insertions(+)
>> --
>>
>>
>> http://git-wip-us.apache.org/repos/asf/mesos/blob/5a5dd8a4/
>> src/master/main.cpp
>> --
>> diff --git a/src/master/main.cpp b/src/master/main.cpp
>> index da75fe9..d485a06 100644
>> --- a/src/master/main.cpp
>> +++ b/src/master/main.cpp
>> @@ -57,6 +57,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include "common/build.hpp"
>>  #include "common/http.hpp"
>> @@ -175,6 +176,16 @@ int main(int argc, char** argv)
>>  return EXIT_FAILURE;
>>}
>>
>> +  // Check that master's version has the expected format (SemVer).
>> +  {
>> +Try version = Version::parse(MESOS_VERSION);
>> +if (version.isError()) {
>> +  EXIT(EXIT_FAILURE)
>> +<< "Failed to parse Mesos version '" << MESOS_VERSION << "': "
>> +<< version.error();
>> +}
>> +  }
>> +
>>if (flags.ip_discovery_command.isSome() && flags.ip.isSome()) {
>>  EXIT(EXIT_FAILURE) << flags.usage(
>>  "Only one of `--ip` or `--ip_discovery_command` should be
>> specified");
>>
>> http://git-wip-us.apache.org/repos/asf/mesos/blob/5a5dd8a4/
>> src/slave/main.cpp
>> --
>> diff --git a/src/slave/main.cpp b/src/slave/main.cpp
>> index 31f2b4f..f90aa2f 100644
>> --- a/src/slave/main.cpp
>> +++ b/src/slave/main.cpp
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include "common/build.hpp"
>>  #include "common/http.hpp"
>> @@ -142,6 +143,16 @@ int main(int argc, char** argv)
>>  return EXIT_FAILURE;
>>}
>>
>> +  // Check that agent's version has the expected format (SemVer).
>> +  {
>> +Try version = Version::parse(MESOS_VERSION);
>> +if (version.isError()) {
>> +  EXIT(EXIT_FAILURE)
>> +<< "Failed to parse Mesos version '" << MESOS_VERSION << "': "
>> +<< version.error();
>> +}
>> +  }
>> +
>>if (flags.master.isNone() && flags.master_detector.isNone()) {
>>  cerr << flags.usage("Missing required option `--master` or "
>>  "`--master_detector`.") << endl;
>>
>>


Re: documenting test expactations

2017-05-08 Thread Neil Conway
My two cents:

(1) I almost always need to read the test to understand/debug a test failure.

(2) As long as the intent of the test is clear, I'm not picky about
whether clarifying comments take the form of C++ comments or
explanatory EXPECT messages.

One thing I would be opposed to is adding explanatory EXPECT messages
in an indiscriminate way; IMO that just adds redundancy and violates
DRY (most test expectations are pretty self-explanatory).

Neil

On Mon, May 8, 2017 at 8:24 AM, James Peach  wrote:
>
>> On May 1, 2017, at 4:28 PM, Benjamin Mahler  wrote:
>>
>> Do you have some examples?
>
> I think that this:
>
> EXPECT_EQ(Bytes(512u), BasicBlocks(Bytes(128)).bytes())
> << "a partial block should round up";
>
> is a strict superset of this:
>
> // A partial block should round up.
> EXPECT_EQ(Bytes(512u), BasicBlocks(Bytes(128)).bytes())
>
> The former is preferable since the person triaging test failures gets the 
> immediate context of what the expectation is doing. This is valuable even if 
> you might also find you need to check the source.
>
>>
>> Thinking through my own experience debugging tests, I tend to only get
>> value out of EXPECT messages when they are providing information that I
>> can't get access to from the line number / actual vs expected printing.
>> (e.g. the value of a variable). If the EXPECT message is simply explaining
>> what the test is doing, then I tend to ignore it and read the test instead,
>> so it would be helpful to discuss some examples to get a better sense. :)
>>
>> On Sat, Apr 29, 2017 at 10:02 AM, James Peach  wrote:
>>
>>> Hi all,
>>>
>>> In a couple of reviews, I've been asked to avoid emitting explanatory
>>> messages from the EXPECT() macro. The rationale for this is that tests
>>> usually use comments. However, I think that emitting the reason for a
>>> failed expectation into the test log is pretty helpful and we should do it
>>> more often.
>>>
>>> What do people think about explicitly allowing (or even encouraging) this?
>>> ie. EXPECT(...) << "some explanation goes here"
>>>
>>> J
>


Re: Version numbers during development

2017-05-05 Thread Neil Conway
In my experience, this is reasonably common. For example, Postgres
uses version numbers like "9.5devel" to identify the software in the
development period before the next release process starts.

Neil

On Fri, May 5, 2017 at 4:23 PM, Vinod Kone <vinodk...@apache.org> wrote:
> Is this a common practice that autotools based projects follow?
>
> For publish snapshot JARs to maven for example, we just add "-SNAPSHOT" to
> the version tag (and hence to the JAR name) before publishing it to maven;
> without a need to change the version in source control.
>
> On Fri, May 5, 2017 at 1:27 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> +1
>>
>> Sent from my iPhone
>>
>> > On May 5, 2017, at 12:56 PM, Neil Conway <neil.con...@gmail.com> wrote:
>> >
>> > Our current practice is that when we create a branch for version X, we
>> > bump the version number in the "master" branch to X+1. For example, we
>> > just created the 1.3.x branch, and bumped the version number in master
>> > to "1.4.0".
>> >
>> > Proposal: we should instead use a version number like "1.4.0-devel" in
>> > the master branch. When the 1.4.x release branch is created, the first
>> > commit in that branch would switch to use the "1.4.0" version number.
>> > Meanwhile, master would be bumped to use "1.5.0-devel".
>> >
>> > The main benefit is to make it easier to distinguish official Mesos
>> > releases from snapshots that are taken from the master branch at some
>> > point during development. Note that according to SemVer, "1.4.0-devel"
>> > is considered to be "older" than "1.4.0", which is the behavior we'd
>> > want.
>> >
>> > Neil
>>


Version numbers during development

2017-05-05 Thread Neil Conway
Our current practice is that when we create a branch for version X, we
bump the version number in the "master" branch to X+1. For example, we
just created the 1.3.x branch, and bumped the version number in master
to "1.4.0".

Proposal: we should instead use a version number like "1.4.0-devel" in
the master branch. When the 1.4.x release branch is created, the first
commit in that branch would switch to use the "1.4.0" version number.
Meanwhile, master would be bumped to use "1.5.0-devel".

The main benefit is to make it easier to distinguish official Mesos
releases from snapshots that are taken from the master branch at some
point during development. Note that according to SemVer, "1.4.0-devel"
is considered to be "older" than "1.4.0", which is the behavior we'd
want.

Neil


Re: [Design doc] RPC: Fault domains in Mesos

2017-04-19 Thread Neil Conway
Hi Maxime,

Thanks for the feedback!

The proposed approach is definitely simplistic. The "Discussion"
section of the design doc describes some of the rationale for starting
with a very simple scheme: basically, because

(a) we want to assign clear semantics to the levels of the hierarchy
(regions are far away from each other and inter-region network links
have high latency; racks are close together and inter-rack network
links have low latency).

(b) we don't want to make life too difficult for framework authors.

(c) most server software (e.g., HDFS, Kafka, Cassandra, etc.) only
understands a simple hierarchy -- in many cases, just a single level
("racks"), or occasionally two levels ("racks" and "DCs").

Can you elaborate on the use-cases that you see for a more complex
hierarchy of fault domains? I'd be happy to chat off-list if you'd
prefer.

Thanks!

Neil

On Tue, Apr 18, 2017 at 1:33 AM, Maxime Brugidou
<maxime.brugi...@gmail.com> wrote:
> Hi Neil,
>
> I really like the idea of incorporating the concept of fault domains in
> Mesos, however I feel like the implementation proposed is a bit narrow to be
> actually useful for most users.
>
> I feel like we could make the fault domains definition more generic. As an
> example in our setup we would like to have something like Region > Building
>> Cage > Pod > Rack. Failure domains would be hierarchically arranged
> (meaning one domain in a lower level can only be included in one domain
> above).
>
> As a concrete example, we could have the mesos masters be aware of the fault
> domain hierarchy (with a config map for example), and slaves would just need
> to declare their lowest-level domain (for example their rack id). Then
> frameworks could use this domain hierarchy at will. If they need to "spread"
> their tasks for a very highly available setup, they could first spread using
> the highest fault domain (like the region), then if they have enough tasks
> to launch they could spread within each sub-domain recursively until they
> run out of tasks to spread. We do not need to artificially limit the number
> of levels of fault domains and the name of the fault domains. Schedulers do
> not need to know the names either, just the hierarchy.
>
> Then, to provide the other feature of "remote" slaves that you describe, we
> could configure the mesos master to only send offers from a "default" local
> fault domain, and frameworks would need to advertise a certain capability to
> receive offers for other remote fault domains.
>
> I feel we could implement this by identifying a fault domain with a simple
> list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"]
> or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level
> fault domains and schedulers could use this arbitrarily as a hierarchical
> list.
>
> Thanks,
> Maxime
>
> On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <neil.con...@gmail.com> wrote:
>>
>> Folks,
>>
>> I'd like to enhance Mesos to support a first-class notion of "fault
>> domains" -- i.e., identifying the "rack" and "region" (DC) where a
>> Mesos agent or master is located. The goal is to enable two main
>> features:
>>
>> (1) To make it easier to write "rack-aware" Mesos frameworks that are
>> portable to different Mesos clusters.
>>
>> (2) To improve the experience of configuring Mesos with a set of
>> masters and agents in one DC, and another pool of "remote" agents in a
>> different DC.
>>
>> For more information, please see the design doc:
>>
>>
>> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>>
>> I'd love any feedback, either directly on the Google doc or via email.
>>
>> Thanks,
>> Neil


[Design doc] RPC: Fault domains in Mesos

2017-04-17 Thread Neil Conway
Folks,

I'd like to enhance Mesos to support a first-class notion of "fault
domains" -- i.e., identifying the "rack" and "region" (DC) where a
Mesos agent or master is located. The goal is to enable two main
features:

(1) To make it easier to write "rack-aware" Mesos frameworks that are
portable to different Mesos clusters.

(2) To improve the experience of configuring Mesos with a set of
masters and agents in one DC, and another pool of "remote" agents in a
different DC.

For more information, please see the design doc:

https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8

I'd love any feedback, either directly on the Google doc or via email.

Thanks,
Neil


Requiring XCode >= 8.0 on OSX

2017-04-08 Thread Neil Conway
XCode < 8 does not support the C++11 `thread_local` construct. As a
result, we added a workaround to use `__thread` on OSX and
`thread_local` on other platforms:

https://reviews.apache.org/r/36845/

Since that workaround was added, XCode 8 has been released (in
September 2016) with support for `thread_local`.

If we required XCode >= 8.0 for building Mesos, we could remove this
workaround and use `thread_local` on all platforms. This would have
some advantages, such as:

(1) More uniform, standards-compliant code between OSX and other platforms.

(2) `thread_local` supports a few features that `__thread` does not,
such as support for non-POD data types.

Therefore, I'd like to propose we make XCode 8 a requirement for
building Mesos on OSX, and then get rid of the `THREAD_LOCAL`
workaround. Since XCode 8.0 requires OSX 10.11.5, this would also
imply dropping support for OSX 10.10.

Feedback welcome.

Neil


Re: Time Zone information in TimeInfo

2017-03-08 Thread Neil Conway
Sounds good: https://reviews.apache.org/r/57422/

Neil

On Mon, Mar 6, 2017 at 8:53 PM, Zameer Manji <zma...@apache.org> wrote:
> The TODO made me think that the time information here could be timezone
> dependent in some cases.
>
> If it's intended to always represent the time since the Unix epoch then TZ
> info is not useful.
>
> I think that comment should be removed for clarity.
>
> On Mon, Mar 6, 2017 at 8:38 PM, Neil Conway <neil.con...@gmail.com> wrote:
>
>> I always found that TODO confusing. If a `TimeInfo` is intended to
>> represent the amount of time that has elapsed since the (Unix) epoch,
>> I would expect it to be timezone independent. Can you clarify why
>> having TZ info would be useful?
>>
>> Neil
>>
>> On Mon, Mar 6, 2017 at 7:51 PM, Zameer Manji <zma...@apache.org> wrote:
>> > Hey,
>> >
>> > I noticed there is a TODO on the TimeInfo for adding Time Zone
>> information.
>> > ```
>> > /**
>> >  * Represents time since the epoch, in nanoseconds.
>> >  */
>> > message TimeInfo {
>> >   required int64 nanoseconds = 1;
>> >
>> >   // TODO(josephw): Add time zone information, if necessary.
>> > }
>> > ```
>> >
>> > Since there is no TZ information attached the timestamp, should
>> frameworks
>> > assume that the Mesos Master system TZ is the same as the framework TZ?
>> > That is what I'm thinking of doing, but I'm not sure what was the
>> intention
>> > of the authors of the API.
>> >
>> > Also, would it be possible to attach TZ information? It would make
>> > understanding the TimeInfo much easier when it is received by the
>> framework.
>> >
>> > --
>> > Zameer Manji
>>
>> --
>> Zameer Manji
>>


Re: Time Zone information in TimeInfo

2017-03-06 Thread Neil Conway
I always found that TODO confusing. If a `TimeInfo` is intended to
represent the amount of time that has elapsed since the (Unix) epoch,
I would expect it to be timezone independent. Can you clarify why
having TZ info would be useful?

Neil

On Mon, Mar 6, 2017 at 7:51 PM, Zameer Manji  wrote:
> Hey,
>
> I noticed there is a TODO on the TimeInfo for adding Time Zone information.
> ```
> /**
>  * Represents time since the epoch, in nanoseconds.
>  */
> message TimeInfo {
>   required int64 nanoseconds = 1;
>
>   // TODO(josephw): Add time zone information, if necessary.
> }
> ```
>
> Since there is no TZ information attached the timestamp, should frameworks
> assume that the Mesos Master system TZ is the same as the framework TZ?
> That is what I'm thinking of doing, but I'm not sure what was the intention
> of the authors of the API.
>
> Also, would it be possible to attach TZ information? It would make
> understanding the TimeInfo much easier when it is received by the framework.
>
> --
> Zameer Manji


Re: Welcome Kevin Klues as a Mesos Committer and PMC member!

2017-03-01 Thread Neil Conway
Congratulations Kevin! Very well-deserved.

Neil

On Wed, Mar 1, 2017 at 2:05 PM, Benjamin Mahler  wrote:
> Hi all,
>
> Please welcome Kevin Klues as the newest committer and PMC member of the
> Apache Mesos project.
>
> Kevin has been an active contributor in the project for over a year, and in
> this time he made a number of contributions to the project: Nvidia GPU
> support [1], the containerization side of POD support (new container init
> process), and support for "attach" and "exec" of commands within running
> containers [2].
>
> Also, Kevin took on an effort with Haris Choudhary to revive the CLI [3]
> via a better structured python implementation (to be more accessible to
> contributors) and a more extensible architecture to better support adding
> new or custom subcommands. The work also adds a unit test framework for the
> CLI functionality (we had no tests previously!). I think it's great that
> Kevin took on this much needed improvement with Haris, and I'm very much
> looking forward to seeing this land in the project.
>
> Here is his committer eligibility document for perusal:
> https://docs.google.com/document/d/1mlO1yyLCoCSd85XeDKIxTYyboK_uiOJ4Uwr6ruKTlFM/edit
>
> Thanks!
> Ben
>
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2]
> https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU
> [3]
> https://docs.google.com/document/d/1r6Iv4Efu8v8IBrcUTjgYkvZ32WVscgYqrD07OyIglsA/


Re: [VOTE] Release Apache Mesos 1.2.0 (rc2)

2017-03-01 Thread Neil Conway
The perf core dump might be addressed if we backport this change:

https://reviews.apache.org/r/56611/

Although my guess is that this isn't a severe problem: for some
as-yet-unknown reason, running `perf` on the host segfaulted, which
causes the test to fail.

Neil

On Wed, Mar 1, 2017 at 11:09 AM, Vinod Kone  wrote:
> Tested on ASF CI.
>
> Saw 2 configurations fail. One was the perf core dump issue
> . Other is a known (since
> 0..28.0) flaky test with Docker fetcher plugin
> .
>
> Withholding the vote until we know the severity of the perf core dump.
>
>
> *Revision*: b9d8202a7444d0d1e49476bfc9817eb4583beaff
>
>- refs/tags/1.1.1-rc2
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> --verbose autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Failed]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
> --verbose autotools
> [image: Success]
> 
> [image: Failed]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
>
> On Wed, Mar 1, 2017 at 9:24 AM, Greg Mann  wrote:
>
>> I wanted to give a heads up on a flaky test failure I've encountered while
>> testing this RC: 'DockerRuntimeIsolatorTest.ROO
>> T_INTERNET_CURL_DockerDefaultEntryptRegistryPuller'. One issue related to
>> this test was resolved recently (https://issues.apache.org/
>> jira/browse/MESOS-6001), but this seems to be a separate issue (
>> https://issues.apache.org/jira/browse/MESOS-7185). I haven't had time to
>> triage yet so I'm not sure if this represents a 

Re: Proposal for Mesos Build Improvements

2017-02-15 Thread Neil Conway
On Wed, Feb 15, 2017 at 1:59 PM, Jeff Coffler
 wrote:
> 3. Maintaining the correct includes is nice, but not at the cost of compiler 
> speed.

Personally, I would invert these statements -- but until we know the
cost of the redundant includes, probably not worth debating further.

> 4. I totally disagree about auto-generating the PCH. We should go through the 
> sources and pick what makes sense. Auto-generating implies that we 
> auto-generate all the time (on every build), and I'd rather not scan the 
> sources during a build (with an associated speed hit) just to try and speed 
> up the build.

The problem is that "what makes sense" will change over time.
Auto-generating the PCH certainly doesn't mean it needs to be
generated as part of the build process: a script (or docker container)
to generate "mesos_common.hpp" on-demand would be fine with me, as
long as it is a mechanical process.

Neil


Re: Proposal for Mesos Build Improvements

2017-02-15 Thread Neil Conway
On Tue, Feb 14, 2017 at 11:28 AM, Jeff Coffler
 wrote:
> For efficiency purposes, if a header file is included by 50% or more of the 
> source files, it should be included in the precompiled header. If a header is 
> included in fewer than 50% of the source files, then it can be separately 
> included (and thus would not benefit from precompiled headers). Note that 
> this is a guideline; even if a header is used by less than 50% of source 
> files, if it's very large, we still may decide to throw it in the precompiled 
> header.

It seems like this would have the effect of creating many false
dependencies: if file X doesn't currently include header Y but Y is
included in the precompiled header, the symbols in Y will now be
visible when X is compiled. It would also mean that X would need to be
recompiled when Y changes.

Related: the current policy is that headers and implementation files
should try to include all of their dependencies, without relying on
transitive includes. For example, if foo.cpp includes bar.hpp, which
includes , but foo.cpp also uses , both foo.cpp and
bar.hpp should "#include ". Adopting precompiled headers would
mean making an exception to this policy, right?

I wonder if we should instead use headers like:

<- mesos_common.h ->
#include 
#include 
#include 

<- xyz.cpp, which needs headers "b" and "d" ->
#include "mesos_common.h>

#include 
#include 

That way, the fact that "xyz.cpp" logically depends on  (but not
 or ) is not obscured (in other words, Mesos should continue to
compile if 'mesos_common.h' is replaced with an empty file). Does
anyone know whether the header guard in  _should_ make the repeated
inclusion of  relatively cheap?

Neil


Re: Proposal for Mesos Build Improvements

2017-02-14 Thread Neil Conway
I'm curious to hear more about how using PCH compares with making
stout a non-header-only library. Is PCH easier to implement, or is it
expected to offer a more dramatic improvement in compile times? Would
making both changes eventually make sense?

Neil

On Tue, Feb 14, 2017 at 11:28 AM, Jeff Coffler
 wrote:
> Proposal For Build Improvements
>
> The Mesos build process is in dire need of some build infrastructure 
> improvements. These improvements will improve speed and ease of work in 
> particular components, and dramatically improve overall build time, 
> especially in the Windows environment, but likely in the Linux environment as 
> well.
>
>
> Background:
>
> It is currently recommended to use the ccache project with the Mesos build 
> process. This makes the Linux build process more tolerable in terms of speed, 
> but unfortunately such software is not available on Windows. Ultimately, 
> though, the caching software is covering up two fundamental flaws in the 
> overall build process:
>
> 1. Lack of use of libraries
> 2. Lack of precompiled headers
>
> By not allowing use of libraries, the overall build process is often much 
> longer, particularly when a lot of work is being done in a particular 
> component. If work is being done in a particular component, only that library 
> need be rebuilt (and then the overall image relinked). Currently, since there 
> is no such modularization, all source files must be considered at build time. 
> Interestingly enough, there is such modularization in the source code layout; 
> that modularization just isn't utilized at the compiler level.
>
> Precompiled headers exist on both Windows and Linux. For Linux, you can refer 
> to https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html. Straight from 
> the GNU CC documentation: "The time the compiler takes to process these 
> header files over and over again can account for nearly all of the time 
> required to build the project."
>
> In my prior use of precompiled headers, each C or C++ file generally took 
> about 4 seconds to compile. After switching to precompiled headers, the 
> precompiled header creation took about 4 seconds, but each C/C++ file now 
> took about 200 milliseconds to compile. The overall build speed was thus 
> dramatically reduced.
>
>
> Scope of Changes:
>
> These changes are only being proposed for the CMake system. Going forward, 
> the CMake system is the easiest way to maintain some level of portability 
> between the Linux and Windows platforms.
>
>
> Details for Modularization:
>
> For the modularization, the intent is to simply make each source directory of 
> files, if functionally separate, to be compiled into an archive (.a) file. 
> These archive files will then be linked together to form the actual 
> executables. These changes will primarily be in the CMake system, and should 
> have limited effect on any actual source code.
>
> At a later date, if it makes sense, we can look at building shared library 
> (.so) files. However, this only makes the most sense if the code is truly 
> shared between different executable files. If that's not the case, then it 
> likely makes sense just to stick with .a files. Regardless, generation of .so 
> files is out of scope for this change.
>
>
> Details for Precompiled Header Changes:
>
> Precompiled headers will make use of stout (a very large header-only library) 
> essentially "free" from a compile-time overhead point of view. Basically, 
> precompiled headers will take a list of header files (including very long 
> header files, like "windows.h"), and generate the compiler memory structures 
> for their representation.
>
> During precompiled header generation, these memory structures are flushed to 
> disk. Then, when components are built, the memory structures are reloaded 
> from disk, which is dramatically faster than actually parsing the tens of 
> thousands of lines of header files and building the memory structures.
>
> For precompiled headers to be useful, a relatively "consistent" set of 
> headers must be included by all of the C/C++ files. So, for example, consider 
> the following C file:
>
> #if defined(windows)
> #include 
> #endif
>
> #include 
> #include 
> #include 
>
> < - Remainder of module - >
>
> To make a precompiled header for this module, all of the #include files would 
> be included in a new file, mesos_common.h. The C file would then be changed 
> as follows:
>
> #include "mesos_common.h"
>
> < - Remainder of module - >
>
> Structurally, the code is identical, and need not be built with precompiled 
> headers. However, use of precompiled headers will make file compilation 
> dramatically faster.
>
> Note that other include files can be included after the precompiled header if 
> appropriate. For example, the following is valid:
>
> #include "mesos_common.h"
> #inclue 
>
> < - Remainder of module - >
>
> For efficiency purposes, if a header file is included by 50% or 

Re: Tracking deprecated features

2017-02-07 Thread Neil Conway
Strongly agree that this can and should be improved! Two questions/suggestions:

(1) Should we use JIRA, the website/docs, or both? If we only use
JIRA, it might not be obvious to users that, e.g., the "--roles"
master flag is deprecated. An alternative would be a table in the
docs, listing (a) when a feature was deprecated, (b) when a feature
will be removed, (c) links to JIRAs.

(2) In some ways, experimental features are the inverse of deprecated
features (e.g., typical evolution might be experimental -> stable ->
deprecated -> removed). We should make it more clear to users (a)
which features are currently experimental, and (b) when those
experimental features graduate to being "stable". I wonder if we could
use a similar system to what you propose for making this information
more clear to users.

Neil

One thing I wonder about is whether we should use the website/docs,
JIRA, or both
On Tue, Feb 7, 2017 at 2:56 AM, Benjamin Bannier
 wrote:
> Hi,
>
> we currently track deprecation of features largely through TODOs in the 
> source code. Here we typically write down a release at which a deprecated 
> feature should be removed.
>
> I believe this is less than optimal since
>
> * it is hard for users of our APIs to track when a deprecated feature is 
> actually removed,
> * it seems to encourage versioning-related discussions to happen in 
> potentially low-visibility review requests instead of JIRA tickets,
> * this approach can lead to wrong or misleading information in the code as 
> our versioning policies evolve and mature, and
> * poor trackability of outstanding deprecations leads to lots of missed 
> opportunities to remove features already out of their deprecation cycle as we 
> prepare releases.
>
> I would like to propose to use JIRA for tracking deprecations instead.
>
> A possible approach would be:
>
> 1) When a feature becomes deprecated, a JIRA ticket is created for its 
> removal. The ticket can be referenced in the source code.
> 2) The ticket should be tagged with e.g. `deprecation`, and optimally link 
> back to the ticket triggering the deprecation.
> 3) A target version is set in collaboration with maintainers of the 
> versioning policy.
> 4) The release process is updated to involve bumping target versions of 
> unfixed deprecation tickets to the following version.
>
> I believe with this we would be able to better keep track and ultimately fix 
> tech debt, as well as better improve communicating breaking to users.
>
> Any thoughts?
>
>
> Cheers,
>
> Benjamin


Re: Build failed in Jenkins: Mesos-Buildbot » autotools,gcc,--verbose,GLOG_v=1 MESOS_VERBOSE=1,ubuntu:14.04,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2) #3220

2017-02-06 Thread Neil Conway
I haven't seen this test fail elsewhere, but there's at least one
other instance of it failing on ASF CI. Unfortunately I couldn't fetch
the logs in either case (any chance we change the ASF Jenkins
configuration to keep logs for failing jobs for longer?). I'll keep an
eye out to see if I can get a log.

Neil

On Sat, Feb 4, 2017 at 1:24 PM, Benjamin Mahler  wrote:
> +neil
>
> Is this test flaky? I wasn't able to grab the logs since jenkins appears to
> be non-responsive at the moment.
>
> [  FAILED  ] RegistrarTest.PruneUnreachable
>
> On Sat, Feb 4, 2017 at 3:03 AM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> See > autotools,COMPILER=gcc,CONFIGURATION=--verbose,
>> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
>> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3220/changes>
>>
>> Changes:
>>
>> [xujyan] Use the stout ELF parser to implement ldd.
>>
>> [xujyan] Add some simple ldd() tests.
>>
>> [xujyan] Use the stout ELF parser to collect Linux rootfs files.
>>
>> [benjamin.hindman] Replaced recursive implementation in http::Connection
>> with loop.
>>
>> [benjamin.hindman] Re-enabled disabled test.
>>
>> [benjamin.hindman] Replaced (another) recursive implementation with
>> process::loop.
>>
>> [benjamin.hindman] Re-enabled a test.
>>
>> [benjamin.hindman] Introduced process::after.
>>
>> [benjamin.hindman] Used process::after instead of process::RateLimiter.
>>
>> --
>> [...truncated 177912 lines...]
>> I0204 11:00:49.951364 30284 status_update_manager.cpp:203] Recovering
>> status update manager
>> I0204 11:00:49.951576 30279 containerizer.cpp:599] Recovering containerizer
>> I0204 11:00:49.952945 30290 provisioner.cpp:410] Provisioner recovery
>> complete
>> I0204 11:00:49.953316 30280 slave.cpp:5499] Finished recovery
>> I0204 11:00:49.953835 30280 slave.cpp:5673] Querying resource estimator
>> for oversubscribable resources
>> I0204 11:00:49.954129 30279 slave.cpp:5687] Received oversubscribable
>> resources {} from the resource estimator
>> I0204 11:00:49.957427 30284 process.cpp:3697] Handling HTTP event for
>> process 'slave(694)' with path: '/slave(694)/monitor/statistics.json'
>> I0204 11:00:49.958652 30289 http.cpp:871] Authorizing principal
>> 'test-principal' to GET the '/monitor/statistics.json' endpoint
>> I0204 11:00:49.962405 30280 slave.cpp:803] Agent terminating
>> [   OK ] Endpoint/SlaveEndpointTest.AuthorizedRequest/1 (31 ms)
>> [ RUN  ] Endpoint/SlaveEndpointTest.AuthorizedRequest/2
>> I0204 11:00:49.972606 30259 containerizer.cpp:220] Using isolation:
>> posix/cpu,posix/mem,filesystem/posix,network/cni
>> W0204 11:00:49.973045 30259 backend.cpp:76] Failed to create 'aufs'
>> backend: AufsBackend requires root privileges
>> W0204 11:00:49.973139 30259 backend.cpp:76] Failed to create 'bind'
>> backend: BindBackend requires root privileges
>> I0204 11:00:49.973171 30259 provisioner.cpp:249] Using default backend
>> 'copy'
>> I0204 11:00:49.976068 30278 slave.cpp:211] Mesos agent started on (695)@
>> 172.17.0.4:47679
>> I0204 11:00:49.976094 30278 slave.cpp:212] Flags at startup: --acls=""
>> --appc_simple_discovery_uri_prefix="http://; 
>> --appc_store_dir="/tmp/mesos/store/appc"
>> --authenticate_http_readonly="true" --authenticate_http_readwrite="true"
>> --authenticatee="crammd5" --authentication_backoff_factor="1secs"
>> --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos" 
>> --container_disk_watch_interval="15secs"
>> --containerizers="mesos" --credential="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_2_WNyVh6/credential" --default_role="*"
>> --disk_watch_interval="1mins" --docker="docker"
>> --docker_kill_orphans="true" --docker_registry="https://
>> registry-1.docker.io" --docker_remove_delay="6hrs"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
>> --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_
>> dir="/var/run/mesos/isolators/docker/volume" 
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="1mins" 
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_2_WNyVh6/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname_lookup="true"
>> --http_authenticators="basic" --http_command_executor="false"
>> --http_credentials="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_2_WNyVh6/http_credentials" 
>> --http_heartbeat_interval="30secs"
>> --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem"
>> --launcher="posix" --launcher_dir="/mesos/mesos-1.2.0/_build/src"
>> --logbufsecs="0" --logging_level="INFO" 
>> 

Re: Disallowing pre-1.0 Mesos agents

2017-01-23 Thread Neil Conway
Right -- I'll add a note to that effect to the CHANGELOG for 1.3.0.

Neil

On Fri, Jan 20, 2017 at 7:11 PM, Benjamin Mahler <bmah...@apache.org> wrote:
> +1
>
> Thanks for taking on the explicit rejection, much needed safety
> improvement. If we state that a Mesos 1.3.0 master will not support pre-1.0
> agents, it seems to carry an implication that we supported 1.2 and 1.1
> having pre-1.0 agents. While it might work, we should clarify for users per
> vinod's comment.
>
> On Fri, Jan 20, 2017 at 12:45 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> +1
>>
>> Technically 0.28.0 was only supposed to be compatible with 0.27.0 and 1.0.
>>
>>
>> On Fri, Jan 20, 2017 at 8:02 PM, Zameer Manji <zma...@apache.org> wrote:
>>
>> > +1
>> >
>> >
>> >
>> > On Fri, Jan 20, 2017 at 10:58 AM, Neil Conway <neil.con...@gmail.com>
>> > wrote:
>> >
>> > > I'd like to propose that the Mesos 1.3.0 should not allow pre-1.0
>> > > Mesos agents to register.
>> > >
>> > > Motivation:
>> > >
>> > > (1) We can simplify the master code in a few places. For example, we
>> > > can assume that we always have a FrameworkInfo for any task running on
>> > > a registered agent. Needing to handle running tasks without a
>> > > FrameworkInfo makes the code unreadable and has been a source of bugs.
>> > >
>> > > (2) The master only needs to report "orphan tasks" and "unregistered
>> > > frameworks" if the cluster contains pre-1.0 agents. If we disallow
>> > > such agents, we can remove the code for computing these fields in the
>> > > HTTP endpoints and elsewhere. (We'll probably still need to keep the
>> > > actual fields in the JSON/protobuf output for backward compatibility,
>> > > but they will always be empty.) We can also remove "orphan tasks" from
>> > > the web UI.
>> > >
>> > > In addition to declaring that Mesos 1.3.0 masters will not support
>> > > pre-1.0 Mesos agents in the CHANGELOG, it seems safer to me to
>> > > disallow such agents from registering.
>> > >
>> > > Comments welcome.
>> > >
>> > > Thanks,
>> > > Neil
>> > >
>> > > --
>> > > Zameer Manji
>> > >
>> >
>>


Re: Welcome Neil Conway as Mesos Committer and PMC member!

2017-01-22 Thread Neil Conway
Thanks for the kind words, everyone! It's been a pleasure to be a part
of the Mesos community, and I'm looking forward to continuing to
contribute.

Neil

On Sun, Jan 22, 2017 at 2:16 PM, Benjamin Mahler <bmah...@apache.org> wrote:
> Congrats and welcome!
>
> On Fri, Jan 20, 2017 at 11:03 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> Hi folks,
>>
>> Please welcome Neil Conway as the newest committer and PMC member of the
>> Apache Mesos project.
>>
>> Neil has been an active contributor to Mesos for more than a year now. As
>> part of his work, he has contributed some major features (Partition aware
>> frameworks, floating point operations for resources). Neil also took the
>> initiative to improve the documentation of our project and shepherded
>> several improvements over time. Doing that even without being a committer,
>> shows that he takes ownership of the project seriously.
>>
>> Here is his more formal checklist for your perusal.
>>
>> https://docs.google.com/document/d/137MYwxEw9QCZRH09CXfn1544
>> p1LuMuoj9LxS-sk2_F4/edit
>>
>> Thanks,
>> Vinod
>>


Disallowing pre-1.0 Mesos agents

2017-01-20 Thread Neil Conway
I'd like to propose that the Mesos 1.3.0 should not allow pre-1.0
Mesos agents to register.

Motivation:

(1) We can simplify the master code in a few places. For example, we
can assume that we always have a FrameworkInfo for any task running on
a registered agent. Needing to handle running tasks without a
FrameworkInfo makes the code unreadable and has been a source of bugs.

(2) The master only needs to report "orphan tasks" and "unregistered
frameworks" if the cluster contains pre-1.0 agents. If we disallow
such agents, we can remove the code for computing these fields in the
HTTP endpoints and elsewhere. (We'll probably still need to keep the
actual fields in the JSON/protobuf output for backward compatibility,
but they will always be empty.) We can also remove "orphan tasks" from
the web UI.

In addition to declaring that Mesos 1.3.0 masters will not support
pre-1.0 Mesos agents in the CHANGELOG, it seems safer to me to
disallow such agents from registering.

Comments welcome.

Thanks,
Neil


Re: Map support in proto2

2016-12-18 Thread Neil Conway
I believe `oneof` is supported in protobuf 2.6.1 [1], so we wouldn't
need to upgrade to make use of it. But I agree that upgrading to
protobuf 3 (while continuing to use the proto2 language version) is
worth doing at some point.

Neil

[1] https://groups.google.com/forum/#!topic/protobuf/lvI8-sWZbUY/discussion

On Sun, Dec 18, 2016 at 4:31 PM, Michael Park  wrote:
> According to this thread
> https://groups.google.com/forum/m/#!topic/protobuf/p4WxcplrlA4,
>
> It would involve us upgrading proto to 3.0.0 (which does not mean proto3).
> It seems like we would also get `oneof` which would also be very useful for
> us.
>
> On Sun, Dec 18, 2016 at 11:00 AM Jie Yu  wrote:
>
>> Looks like Map is already supported in proto2:
>>
>>
>> https://developers.google.com/protocol-buffers/docs/reference/cpp-generated#map-fields
>>
>>
>>
>> It'll greatly simply our json parsing in a few cases (e.g., converting from
>>
>> OCI image configuration to a protobuf definition for key->value json
>>
>> object):
>>
>>
>>
>> "annotations" : {
>>
>>   "key1" : "value1",
>>
>>   "key2" : "value2"
>>
>> }
>>
>>
>>
>> Previously, it's not possible to have a protobuf scheme for such a json.
>>
>>
>>
>> Thoughts?
>>
>>
>>
>> - Jie
>>
>>


Re: Building on OS X 10.12

2016-12-12 Thread Neil Conway
I think we should look into adopting "-fvisibility=hidden" and
explicitly annotating the symbols that we want to export:

https://issues.apache.org/jira/browse/MESOS-6734

Although I agree this isn't a trivial change and it would be good to
have some tool support here, but there are lots of benefits [1,2].

Neil

[1] https://gcc.gnu.org/wiki/Visibility
[2] https://software.intel.com/sites/default/files/m/a/1/e/dsohowto.pdf

On Mon, Dec 12, 2016 at 2:17 PM, Joris Van Remoortere
 wrote:
> There are a significant number of developer and runtime performance
> benefits from turning this flag on.
> In my opinion it is also a dangerous flag to turn on by default without a
> strict set of rules for our codebase to ensure that we don't accidentally:
>
>- have multiple instances of static variables when we think they are a
>singleton
>- run into inequality when we expect equality for comparison of in-lined
>function pointers (For example when building a vtable in a library for
>something like variant / visitor)
>
> Although the likelihood that our codebase would suffer from these is low, I
> would prefer to have clang-tidy support and have some rule checkers to
> ensure that we can turn this flag on by default and know we will catch any
> future code that may break these rules.
>
> @James have you done any validation of the codebase and the libraries we
> depend on to ensure this is safe?
>
> Joris
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Mon, Dec 5, 2016 at 1:16 PM, James Peach  wrote:
>
>>
>> > On Dec 2, 2016, at 10:54 PM, Jie Yu  wrote:
>> >
>> > Another tip. If you are on macOS sierra, you might notice the linking is
>> > extremely slow using the default clang.
>> >
>> > Using CXXFLAGS `-fvisibility-inlines-hidden` will greatly speedup the
>> > linking.
>>
>> Is there a reason we should not always do this? It reduces the number of
>> exported symbols in libmesos.so from 250K to 100K.
>>
>> J


Re: Duplicate task IDs

2016-12-12 Thread Neil Conway
On Mon, Dec 12, 2016 at 1:32 PM, Joris Van Remoortere
 wrote:
> It sounds like using a multi_hashmap for now allows you to clean up the
> code and avoid some bugs, without changing the existing behavior.

Because we want cache-like behavior (bounded size + LRU replacement),
this would require adding a new data structure, BoundedMultiHashMap
(https://reviews.apache.org/r/54178/). That seems like overkill to me,
for now.

> It would also be unfortunate if we said we were dis-allowing duplicate task
> ids but only catch some of the manifestations.

Definitely unfortunate, but I don't see an alternative, as long as we
continue to allow frameworks to freely choose their own task IDs.

Neil


Re: Duplicate task IDs

2016-12-12 Thread Neil Conway
Hi Joris,

Fair point: I didn't deliberately set out to change the behavior for
duplicate task IDs. Rather, it was a consequence of switching from
boost::circular_buffer to using a hashmap for managing completed
tasks. Using a hashmap has a few minor advantages [1], but we can
certainly continue using circular_buffer (or a multi-hashmap) if we
want to keep the current behavior.

I think we have the following options:

(1) Keep the current behavior: reusing task IDs is discouraged but supported.

(2) Per Alex's suggestion, we can say that frameworks are no longer
allowed to reuse task IDs. Because the master only keeps a
limited-size cache of completed tasks (which is not preserved across
master restart or failover), we wouldn't be able to reject all
situations in which frameworks attempt to reuse task IDs.

If we pursue #2, we might need a deprecation period or master
capability to give framework authors some time to migrate.

For the moment, I'll avoid changing the behavior for duplicate task
IDs; I've opened https://issues.apache.org/jira/browse/MESOS-6779 to
track this issue. If you have an opinion in this change, please
weigh-in, either on this thread or on JIRA.

Neil

[1] Specifically, making the management of completed and unreachable
tasks more symmetric and avoiding some bugs/UBI in
boost::circular_buffer. O(1) lookup of completed tasks might be useful
in the future but isn't used right now.

On Fri, Dec 9, 2016 at 2:13 PM, Joris Van Remoortere
<jo...@mesosphere.io> wrote:
> Hey Neil,
>
> I concur that using duplicate task IDs is bad practice and asking for
> trouble.
>
> Could you please clarify *why* you want to use a hashmap? Is your goal to
> remove duplicate task IDs or is this just a side-effect and you have a
> different reason (e.g. performance) for using a hashmap?
>
> I'm wondering why a multi-hashmap is not sufficient. This would be clear if
> you were explicitly *trying* to get rid of duplicates of course :-)
>
> Thanks,
> Joris
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Dec 9, 2016 at 7:08 AM, Neil Conway <neil.con...@gmail.com> wrote:
>
>> Folks,
>>
>> The master stores a cache of metadata about recently completed tasks;
>> for example, this information can be accessed via the "/tasks" HTTP
>> endpoint or the "GET_TASKS" call in the new Operator API.
>>
>> The master currently stores this metadata using a list; this means
>> that duplicate task IDs are permitted. We're considering [1] changing
>> this to use a hashmap instead. Using a hashmap would mean that
>> duplicate task IDs would be discarded: if two completed tasks have the
>> same task ID, only the metadata for the most recently completed task
>> would be retained by the master.
>>
>> If this behavior change would cause problems for your framework or
>> other software that relies on Mesos, please let me know.
>>
>> (Note that if you do have two completed tasks with the same ID, you'd
>> need an unambiguous way to tell them apart. As a recommendation, I
>> would strongly encourage framework authors to never reuse task IDs.)
>>
>> Neil
>>
>> [1] https://reviews.apache.org/r/54179/
>>


Duplicate task IDs

2016-12-09 Thread Neil Conway
Folks,

The master stores a cache of metadata about recently completed tasks;
for example, this information can be accessed via the "/tasks" HTTP
endpoint or the "GET_TASKS" call in the new Operator API.

The master currently stores this metadata using a list; this means
that duplicate task IDs are permitted. We're considering [1] changing
this to use a hashmap instead. Using a hashmap would mean that
duplicate task IDs would be discarded: if two completed tasks have the
same task ID, only the metadata for the most recently completed task
would be retained by the master.

If this behavior change would cause problems for your framework or
other software that relies on Mesos, please let me know.

(Note that if you do have two completed tasks with the same ID, you'd
need an unambiguous way to tell them apart. As a recommendation, I
would strongly encourage framework authors to never reuse task IDs.)

Neil

[1] https://reviews.apache.org/r/54179/


Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose --enable-libevent --enable-ssl,GLOG_v=1 MESOS_VERBOSE=1,ubuntu:14.04,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6)&&(!ubuntu-eu2) #2933

2016-11-16 Thread Neil Conway
Has there been any response from the ASF Infra folks on addressing the
VM/hardware issues? Seems like it will be difficult to get good signal
from the ASF CI in the absence of some improvements on the
infrastructure side.

Neil

On Wed, Nov 16, 2016 at 10:45 AM, Alex R  wrote:
> Looks like VM lag again: http://pastebin.com/GZhG4fuN
>
> What do folks think about removing future timeouts in tests altogether?
> Instead, we can time the whole suite differently on different CIs?
>
> On 16 November 2016 at 15:30, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> See > COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%
>> 20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=
>> ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-
>> us1)&&(!ubuntu-6)&&(!ubuntu-eu2)/2933/changes>
>>
>> Changes:
>>
>> [alexr] Added a comment about deprecation cycle of quota get authz.
>>
>> --
>> [...truncated 222325 lines...]
>> I1116 14:27:35.400284 30948 containerizer.cpp:202] Using isolation:
>> posix/cpu,posix/mem,filesystem/posix,network/cni
>> W1116 14:27:35.400846 30948 backend.cpp:76] Failed to create 'aufs'
>> backend: AufsBackend requires root privileges, but is running as user mesos
>> W1116 14:27:35.400980 30948 backend.cpp:76] Failed to create 'bind'
>> backend: BindBackend requires root privileges
>> I1116 14:27:35.405436 30978 slave.cpp:208] Mesos agent started on (644)@
>> 172.17.0.3:56829
>> I1116 14:27:35.405462 30978 slave.cpp:209] Flags at startup: --acls=""
>> --appc_simple_discovery_uri_prefix="http://; 
>> --appc_store_dir="/tmp/mesos/store/appc"
>> --authenticate_http_readonly="true" --authenticate_http_readwrite="true"
>> --authenticatee="crammd5" --authentication_backoff_factor="1secs"
>> --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos" 
>> --container_disk_watch_interval="15secs"
>> --containerizers="mesos" --credential="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_1_6t56bO/credential" --default_role="*"
>> --disk_watch_interval="1mins" --docker="docker"
>> --docker_kill_orphans="true" --docker_registry="https://
>> registry-1.docker.io" --docker_remove_delay="6hrs"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
>> --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_
>> dir="/var/run/mesos/isolators/docker/volume" 
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="1mins" 
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_1_6t56bO/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname_lookup="true"
>> --http_authenticators="basic" --http_command_executor="false"
>> --http_credentials="/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_1_6t56bO/http_credentials" 
>> --image_provisioner_backend="copy"
>> --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem"
>> --launcher="posix" --launcher_dir="/mesos/mesos-1.2.0/_build/src"
>> --logbufsecs="0" --logging_level="INFO" 
>> --max_completed_executors_per_framework="150"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="10ms" --resources="cpus:2;gpus:0;
>> mem:1024;disk:1024;ports:[31000-32000]" --revocable_cpu_low_priority="true"
>> --runtime_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_6t56bO"
>> --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
>> --switch_user="true" --systemd_enable_support="true"
>> --systemd_runtime_directory="/run/systemd/system" --version="false"
>> --work_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_OjLwwx"
>> I1116 14:27:35.406361 30978 credentials.hpp:86] Loading credential for
>> authentication from '/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_1_6t56bO/credential'
>> I1116 14:27:35.406677 30978 slave.cpp:346] Agent using credential for:
>> test-principal
>> I1116 14:27:35.406782 30978 credentials.hpp:37] Loading credentials for
>> authentication from '/tmp/Endpoint_SlaveEndpointTest_
>> AuthorizedRequest_1_6t56bO/http_credentials'
>> I1116 14:27:35.407196 30978 http.cpp:895] Using default 'basic' HTTP
>> authenticator for realm 'mesos-agent-readonly'
>> I1116 14:27:35.407546 30978 http.cpp:895] Using default 'basic' HTTP
>> authenticator for realm 'mesos-agent-readwrite'
>> I1116 14:27:35.409431 30978 slave.cpp:533] Agent resources: cpus(*):2;
>> mem(*):1024; disk(*):1024; ports(*):[31000-32000]
>> I1116 14:27:35.409689 30978 slave.cpp:541] Agent attributes: [  

Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose --enable-libevent --enable-ssl,GLOG_v=1 MESOS_VERBOSE=1,centos:7,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6) #2852

2016-10-31 Thread Neil Conway
I spent a little while looking into this. The
"PersistentVolumeEndpointsTest.OfferCreateThenEndpointRemove" test
fails on the following expectations:

https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1562
https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1564
https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1573

Which all seem quite innocent: similar or identical preamble code
occurs in many test cases. Looking at the log, it seems the scheduler
begins the authentication process but authentication times out:

12:27:56.527899 31618 sched.cpp:226] Version: 1.2.0
12:27:56.528548 31638 sched.cpp:330] New master detected at
master@172.17.0.2:48653
12:27:56.528661 31638 sched.cpp:396] Authenticating with master
master@172.17.0.2:48653
12:27:56.528681 31638 sched.cpp:403] Using default CRAM-MD5 authenticatee
12:28:01.529717 31637 sched.cpp:526] Authentication timed out
12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
master master@172.17.0.2:48653: Authentication discarded

In the scheduler driver, we fail the "authenticating" future at
12:28:01, but it is ~9 seconds before the associated `onAny` callback
is invoked to schedule retrying authentication; by the time the retry
backoff timeout expires, we've exceeded the 15 second Future timeout
in the test case.

Note that between 12:28:01.5 and 12:28:10.8, there is essentially
nothing happening:

W1031 12:28:01.529717 31637 sched.cpp:526] Authentication timed out
W1031 12:28:01.529752 31645 master.cpp:6789] Authentication timed out
I1031 12:28:10.794798 31652 status_update_manager.cpp:203] Recovering
status update manager
W1031 12:28:10.795033 31645 master.cpp:6769] Failed to authenticate
scheduler-877be3e9-9dc1-4de1-bf3e-a19b77b3d124@172.17.0.2:48653:
Authentication discarded
I1031 12:28:10.794939 31647 authenticator.cpp:432] Authentication
session cleanup for crammd5-authenticatee(655)@172.17.0.2:48653
I1031 12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
master master@172.17.0.2:48653: Authentication discarded

So I think the most likely culprit is VM lag.

We can try to workaround this by increasing some of the timeouts for
the test expectation futures, but of course that is just a kludge: if
we're going to experience random ~9.5 second VM-wide pauses, the tests
are likely to continue to be flaky unless we make more widespread
changes (e.g., increasing ALL expectation futures timeouts).

Neil


On Mon, Oct 31, 2016 at 8:34 AM, Apache Jenkins Server
 wrote:
> See 
> 
>
> Changes:
>
> [alexr] Updated the stale comment in agent flags.
>
> --
> [...truncated 219320 lines...]
> W1031 12:32:10.921492 31618 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges, but is running as user mesos
> W1031 12:32:10.921664 31618 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1031 12:32:10.925060 31647 slave.cpp:208] Mesos agent started on 
> (635)@172.17.0.2:48653
> I1031 12:32:10.925091 31647 slave.cpp:209] Flags at startup: --acls="" 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" 
> --credential="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_j6HfxC/credential"
>  --default_role="*" --disk_watch_interval="1mins" --docker="docker" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_j6HfxC/fetch"
>  --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
> --gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" 
> 

Re: mesos git commit: Added MESOS-6497 to CHANGELOG.

2016-10-28 Thread Neil Conway
This commit should also appear in the master branch, not just 1.1.x

Neil

On Fri, Oct 28, 2016 at 4:06 PM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/1.1.x bc7ecb8cf -> 7fce1b33f
>
>
> Added MESOS-6497 to CHANGELOG.
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/7fce1b33
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/7fce1b33
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/7fce1b33
>
> Branch: refs/heads/1.1.x
> Commit: 7fce1b33fd7b0ef3f8dcfaa2d6557da1e3c6f957
> Parents: bc7ecb8
> Author: Till Toenshoff 
> Authored: Fri Oct 28 20:23:48 2016 +0200
> Committer: Till Toenshoff 
> Committed: Fri Oct 28 22:03:39 2016 +0200
>
> --
>  CHANGELOG | 1 +
>  1 file changed, 1 insertion(+)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/7fce1b33/CHANGELOG
> --
> diff --git a/CHANGELOG b/CHANGELOG
> index 3f03be0..d0a679d 100644
> --- a/CHANGELOG
> +++ b/CHANGELOG
> @@ -213,6 +213,7 @@ All Issues:
>* [MESOS-6446] - WebUI redirect doesn't work with stats from 
> /metric/snapshot.
>* [MESOS-6482] - Master check failure when marking an agent unreachable.
>* [MESOS-6483] - Check failure when a 1.1 master marking a 0.28 agent as 
> unreachable.
> +  * [MESOS-6497] - Java Scheduler Adapter does not surface MasterInfo.
>
>  ** Documentation
>* [MESOS-5221] - Add Documentation for Nvidia GPU support.
>


Re: Non-checkpointing frameworks

2016-10-18 Thread Neil Conway
Hi folks,

Thanks for the feedback!

On Mon, Oct 17, 2016 at 12:44 PM, Zhitao Li  wrote:
> +1 to both A to B.
>
> Do we plan to eventually drop non-checkpionted framework support (possibly
> in v2) and declare that all frameworks has to operate in this assumption?

I think that's worth considering for v2, *if* we don't find anyone
that has good reasons for disabling checkpointing. But I'd expect it
is more likely we keep it around as an option that is disabled by
default.

Neil


Non-checkpointing frameworks

2016-10-14 Thread Neil Conway
Hi folks,

I'd like input from individuals who currently use frameworks but do
not enable checkpointing.

Background: "checkpointing" is a parameter that can be enabled in
FrameworkInfo; if enabled, the agent will write the framework pid,
executor PIDs, and status updates to disk for any tasks started by
that framework. This checkpointed information means that these tasks
can survive an agent crash: if the agent exits (whether due to
crashing or as part of an upgrade procedure), a restarted agent can
use this information to reconnect to executors started by the previous
instance of the agent. The downside is that checkpointing requires
some additional disk I/O at the agent.

Checkpointing is not currently the default, but in my experience it is
often enabled for production frameworks. As part of the work on
supporting partition-aware Mesos frameworks (see MESOS-4049), we are
considering:

(a) requiring that partition-aware frameworks must also enable
checkpointing, and/or
(b) enabling checkpointing by default

If you have intentionally decided to disable checkpointing for your
Mesos framework, I'd be curious to hear more about your use-case and
why you haven't enabled it.

Thanks!

Neil


Re: mesos git commit: Added `DEFAULT_ROLE` constant to persistent volume tests.

2016-09-22 Thread Neil Conway
I'm not sure this is a good idea: the "default role" is actually "*".
That is also the default value for the "role" fields in the protobufs.
Perhaps we should name this new constant something like
DEFAULT_TEST_ROLE?

I wonder also if we should keep the definition local to
"persistent_volume_tests.cpp", unless there are imminent plans to use
it elsewhere.

Neil

On Thu, Sep 22, 2016 at 6:37 PM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 4df496aaf -> c2b595e1c
>
>
> Added `DEFAULT_ROLE` constant to persistent volume tests.
>
> Review: https://reviews.apache.org/r/41613/
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/c2b595e1
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/c2b595e1
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/c2b595e1
>
> Branch: refs/heads/master
> Commit: c2b595e1c59bb23e3f87545a7e22a76ad232ae9f
> Parents: 4df496a
> Author: Greg Mann 
> Authored: Thu Sep 22 15:47:31 2016 +0200
> Committer: Michael Park 
> Committed: Thu Sep 22 17:28:32 2016 +0200
>
> --
>  src/tests/mesos.hpp   |  1 +
>  src/tests/persistent_volume_tests.cpp | 51 +-
>  2 files changed, 30 insertions(+), 22 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/c2b595e1/src/tests/mesos.hpp
> --
> diff --git a/src/tests/mesos.hpp b/src/tests/mesos.hpp
> index 7095101..3cd63bd 100644
> --- a/src/tests/mesos.hpp
> +++ b/src/tests/mesos.hpp
> @@ -93,6 +93,7 @@ namespace tests {
>
>  constexpr char READONLY_HTTP_AUTHENTICATION_REALM[] = "test-readonly-realm";
>  constexpr char READWRITE_HTTP_AUTHENTICATION_REALM[] = 
> "test-readwrite-realm";
> +constexpr char DEFAULT_ROLE[] = "default-role";
>
>  // Forward declarations.
>  class MockExecutor;
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/c2b595e1/src/tests/persistent_volume_tests.cpp
> --
> diff --git a/src/tests/persistent_volume_tests.cpp 
> b/src/tests/persistent_volume_tests.cpp
> index c38d848..a726d78 100644
> --- a/src/tests/persistent_volume_tests.cpp
> +++ b/src/tests/persistent_volume_tests.cpp
> @@ -166,7 +166,7 @@ protected:
>case NONE: {
>  diskResource = createDiskResource(
>  stringify(mb.megabytes()),
> -"role1",
> +DEFAULT_ROLE,
>  None(),
>  None());
>
> @@ -175,7 +175,7 @@ protected:
>case PATH: {
>  diskResource = createDiskResource(
>  stringify(mb.megabytes()),
> -"role1",
> +DEFAULT_ROLE,
>  None(),
>  None(),
>  createDiskSourcePath(path::join(diskPath, "disk" + 
> stringify(id;
> @@ -185,7 +185,7 @@ protected:
>case MOUNT: {
>  diskResource = createDiskResource(
>  stringify(mb.megabytes()),
> -"role1",
> +DEFAULT_ROLE,
>  None(),
>  None(),
>  createDiskSourceMount(
> @@ -254,7 +254,7 @@ TEST_P(PersistentVolumeTest, 
> CreateAndDestroyPersistentVolumes)
>Clock::pause();
>
>FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> -  frameworkInfo.set_role("role1");
> +  frameworkInfo.set_role(DEFAULT_ROLE);
>
>// Create a master.
>master::Flags masterFlags = CreateMasterFlags();
> @@ -429,7 +429,7 @@ TEST_P(PersistentVolumeTest, ResourcesCheckpointing)
>ASSERT_SOME(slave);
>
>FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> -  frameworkInfo.set_role("role1");
> +  frameworkInfo.set_role(DEFAULT_ROLE);
>
>MockScheduler sched;
>MesosSchedulerDriver driver(
> @@ -495,7 +495,7 @@ TEST_P(PersistentVolumeTest, PreparePersistentVolume)
>ASSERT_SOME(slave);
>
>FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> -  frameworkInfo.set_role("role1");
> +  frameworkInfo.set_role(DEFAULT_ROLE);
>
>MockScheduler sched;
>MesosSchedulerDriver driver(
> @@ -564,7 +564,7 @@ TEST_P(PersistentVolumeTest, MasterFailover)
>ASSERT_SOME(slave);
>
>FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> -  frameworkInfo.set_role("role1");
> +  frameworkInfo.set_role(DEFAULT_ROLE);
>
>MockScheduler sched;
>TestingMesosSchedulerDriver driver(, , frameworkInfo);
> @@ -659,7 +659,7 @@ TEST_P(PersistentVolumeTest, 
> IncompatibleCheckpointedResources)
>spawn(slave1);
>
>FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> -  frameworkInfo.set_role("role1");
> +  frameworkInfo.set_role(DEFAULT_ROLE);
>
>MockScheduler sched;
>MesosSchedulerDriver driver(
> @@ -746,7 +746,7 @@ TEST_P(PersistentVolumeTest, 

Fwd: mesos git commit: Fixed a bug in getRootContainerId due to protobuf copying issue.

2016-09-19 Thread Neil Conway
Hi Jie,

Do you have more details on what exactly the problem is here? If
protobuf is unable to copy/merge nested messages in general, that
seems like something that might crop up elsewhere.

Perhaps we can (a) file a JIRA (ideally with a self-contained
test-case), and/or (c) report the problem to upstream?

Neil

-- Forwarded message --
From:  
Date: Sat, Sep 17, 2016 at 11:27 PM
Subject: mesos git commit: Fixed a bug in getRootContainerId due to
protobuf copying issue.
To: comm...@mesos.apache.org


Repository: mesos
Updated Branches:
  refs/heads/master a4fd86bce -> be81a924a


Fixed a bug in getRootContainerId due to protobuf copying issue.

It looks like protobuf is not so great dealing with nesting messages
when doing merge or copy. This patch uses an extra copy to bypass that
issue in the protobuf.

Review: https://reviews.apache.org/r/51992


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/be81a924
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/be81a924
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/be81a924

Branch: refs/heads/master
Commit: be81a924a9e9414ec98f8c9a87a5391dad865146
Parents: a4fd86b
Author: Jie Yu 
Authored: Sat Sep 17 14:22:31 2016 -0700
Committer: Jie Yu 
Committed: Sat Sep 17 14:25:33 2016 -0700

--
 src/slave/containerizer/mesos/utils.hpp | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/mesos/blob/be81a924/src/slave/containerizer/mesos/utils.hpp
--
diff --git a/src/slave/containerizer/mesos/utils.hpp
b/src/slave/containerizer/mesos/utils.hpp
index 2bb55c1..178ebf3 100644
--- a/src/slave/containerizer/mesos/utils.hpp
+++ b/src/slave/containerizer/mesos/utils.hpp
@@ -27,7 +27,12 @@ static ContainerID getRootContainerId(const
ContainerID& containerId)
 {
   ContainerID rootContainerId = containerId;
   while (rootContainerId.has_parent()) {
-rootContainerId = rootContainerId.parent();
+// NOTE: Looks like protobuf does not handle copying well when
+// nesting message is involved, because the source and the target
+// point to the same object. Therefore, we create a temporary
+// variable and use an extra copy here.
+ContainerID id = rootContainerId.parent();
+rootContainerId = id;
   }

   return rootContainerId;


Re: Rate-limiting agent removal w/ PARTITION_AWARE

2016-07-30 Thread Neil Conway
Hi Ben,

Thanks for the feedback! Seems like we're on the same page overall.

On Thu, Jul 28, 2016 at 8:42 AM, Benjamin Mahler  wrote:
> It seems to me that these particular flags are not applicable for
> PARTITION_AWARE frameworks, since there is no removal occurring.

FWIW, I've still been using the term "removal" in the PARTITION_AWARE
branch to describe any situation in which a slave is removed from the
set of registered agents in the registry: e.g., both when we mark a
slave unreachable (move from "admitted" to "unreachable" list in the
registry) and when a slave gracefully disconnects via
UnregisterSlaveMessage (remove from "admitted" list).

> If we want to support schedulers that react poorly, we can add
> per-framework rate limits for unreachable notifications. Operators could
> turn these on to deal with specific frameworks that react poorly.

Sounds reasonable. I'm inclined to not implement this until we have
some evidence that people actually need it.

> In situations where the
> agent is considered unreachable, we won't offer resources, correct?

Correct.

Thanks,
Neil


Rate-limiting agent removal w/ PARTITION_AWARE

2016-07-27 Thread Neil Conway
Hi folks,

There are two "safety limits" in place that control the master's agent
removal behavior:

(1) "--agent_removal_rate_limit" controls the rate at which agents can
be removed from the cluster when they fail health checks.

(2) "--recovery_agent_removal_limit" controls the fraction of agents
in the cluster that can be removed if they fail to reregister within
"--agent_reregister_timeout" after a master failover. If this fraction
is exceeded, the master aborts without removing any agents. If the
fraction is *not* exceeded, any agents that have not reregistered will
be removed at a rate controlled by "--agent_removal_rate_limit", if
one is specified.

In the PARTITION_AWARE world [1], what kind of limits are appropriate?
To begin with, let's assume that all frameworks have opted in to the
PARTITION_AWARE capability.

(1) I'd argue that we no longer want this rate limit in the master: it
should be up to frameworks to decide how to deal with tasks running on
unreachable agents. If a given framework wants to use a rate-limit in
their logic for handling unreachable tasks, that is up to them. If we
applied a rate-limit "upstream" of the frameworks, we are restricting
their ability to define their own partition-handling policies. This
also means applying the same rate-limit to all tasks and all
frameworks, which is undesirable.

(2) This seems less clear to me, but I think you can also make a case
for removing this safety limit as well: in the PARTITION_AWARE world,
removing an agent from the cluster just means that frameworks will be
notified that the master can't communicate with the agent. If the
master fails over and, say, 60% of the agents in the cluster are not
reachable after the "--agent_reregister_timeout" expires, you could
argue that the master should just propagate that information to
frameworks. Typically you'd want an operator to take action in this
situation, but operator involvement can/should be triggered via
orthogonal means (e.g., monitoring the # of removed agents).

I'm curious to hear what people think about this behavior.

In any case, for Mesos < 2, we can't assume that all frameworks will
be PARTITION_AWARE. When an agent is removed and then reregisters,
non-PARTITION_AWARE tasks running on the agent will be shutdown (but
the agent itself will continue running). In principle, it would be
nice to rate-limit the rate at which *tasks* are killed: this would
mean the rate-limit would be ignored by PARTITION_AWARE frameworks,
while still having an effect for old-style frameworks. However, this
seems fairly complex.

I'm inclined to say that for Mesos 1.1, we should document that
PARTITION_AWARE frameworks probably don't want
"--agent_removal_rate_limit" to be configured; we can then consider
removing "--agent_removal_rate_limit" (and maybe
"--recovery_agent_removal_limit") for Mesos 2.0.

Comments welcome.

Neil

[1] 
https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ/edit?usp=sharing


Re: Registering and framework failover

2016-07-13 Thread Neil Conway
Ah, right -- yes, at the moment you need to look at error strings to
decide whether to retry with a new framework ID, unfortunately. IMO we
should introduce error codes or enums to make this process more
reliable, but no one has done so yet:

https://issues.apache.org/jira/browse/MESOS-4548
https://issues.apache.org/jira/browse/MESOS-5322

Neil


On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <ben...@yandex-team.ru> wrote:
> Let me try to clarify:
>
> The problem is that I don't get to decide manually if the framwork
> should try to take a new id or re-use the old one, but it needs to be
> decided programmatically, by an algorithm.
>
> Afaik it's not possible to get the time when the framework disconnected
> from mesos, so it's not possible to know how much time is left until the
> failover timeout runs out. Therefore, if I want to attempt task
> reconciliation, I just have to try registering with my old framework id
> and see what happens.
>
> However, in the case where the failover timeout already passed, I now
> need to programmatically detect this error and try again with an empty
> framework id.
>
> My question was, is it possible to do this?
>
> (also, we actually use a failover timeout of 1 week, but it doesn't
> really change the problem and I mistakenly assumed that an example with
> smaller values would be more intuitive)
>
> On 13.07.2016 14:50, Neil Conway wrote:
>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <ben...@yandex-team.ru> wrote:
>>> imagine the following situation: I am a framework with failover timeout
>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>>> register with the master again.
>>>
>>> If my registration attempt arrives at the master within the time limit
>>> everything will be fine and I even get back the old tasks for
>>> reconciliation, but if it arrives slightly later the framework id is
>>> permanently blocked by mesos, and I am not able to register. Instead, I
>>> will receive an error()-callback with the message "Framework has been
>>> removed".
>>
>> Right: if you set a failover_timeout of 1 hour, your framework is
>> expected to reregister within one hour. If it does not, all of its
>> tasks will be killed and you need to start over with a new
>> FrameworkID. Can you clarify which aspect of this behavior is
>> problematic for you?
>>
>> Note that a failover_timeout of 1 hour is probably a little low.
>>
>>> Is there any way to reliably connect to the master while also
>>> reconciling old tasks if possible?
>>
>> Sorry, not sure what you mean by this.
>>
>> Neil
>>


Re: Registering and framework failover

2016-07-13 Thread Neil Conway
On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno  wrote:
> imagine the following situation: I am a framework with failover timeout
> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
> register with the master again.
>
> If my registration attempt arrives at the master within the time limit
> everything will be fine and I even get back the old tasks for
> reconciliation, but if it arrives slightly later the framework id is
> permanently blocked by mesos, and I am not able to register. Instead, I
> will receive an error()-callback with the message "Framework has been
> removed".

Right: if you set a failover_timeout of 1 hour, your framework is
expected to reregister within one hour. If it does not, all of its
tasks will be killed and you need to start over with a new
FrameworkID. Can you clarify which aspect of this behavior is
problematic for you?

Note that a failover_timeout of 1 hour is probably a little low.

> Is there any way to reliably connect to the master while also
> reconciling old tasks if possible?

Sorry, not sure what you mean by this.

Neil


Re: Disabling the --registry_strict flag in 1.0

2016-07-13 Thread Neil Conway
Hi Jie,

On Wed, Jul 13, 2016 at 12:11 AM, Jie Yu  wrote:
> Does this mean that we'll have to cut another 1.0 RC just for that?

I'd think so.

> If we were to cut another RC (e.g., due to bugs, which is likely), I would
> be happy to include the patch that disables the flag in the new RC. But if
> not, personally, I don't think this is necessary.

I agree it isn't necessary, but I think the cost of cutting another RC
is fairly low. Since we are confident that we don't want framework
authors to rely on the behavior enabled by the registry-strict flag, I
think taking active steps to prevent that is worth the cost.

> As far as I know, no one is setting this flag to true. Can we instead, just
> document that this flag should not be set?

That's another option, although if we want to make the docs and
"--help" output shipped with 1.0 consistent with the website, it would
also require code changes. Given that disabling the flag is a trivial
change, I'd still opt for that over a documentation-only change.

Neil


Re: getting added to contributors

2016-07-13 Thread Neil Conway
Hi Artem,

Sounds good -- one or the other seems fine, but requiring both seems a
bit onerous.

Neil

On Tue, Jul 12, 2016 at 9:14 PM, Artem Harutyunyan <ar...@mesosphere.io> wrote:
> Hi Neil,
>
> It's either that or an email (which is what the 'traditional' way is). The
> reason for requiring this step is that ASF mandates that contributors are
> added to the project manually, to prevent spambots from being able to
> assign tickets to themselves.
>
> So just to clarify, one can either chose to send an email and ask to be
> added to project contributors group in JIRA (no changes here), or they can
> instead submit a PR to contributors.yaml file (just specifying the email
> and JIRA handle should be sufficient) which will result in the same thing.
>
> We will update contribution guidelines to make this explict.
>
> Artem.
>
> On Tue, Jul 12, 2016 at 1:27 AM, Neil Conway <neil.con...@gmail.com> wrote:
>
>> Do we really want everyone who wants to be assigned a JIRA to also add
>> themselves to the YAML file? To me, this adds another step to a
>> contribution process that probably has too many steps already.
>>
>> Neil
>>
>> On Mon, Jul 11, 2016 at 7:31 PM, Vinod Kone <vinodk...@apache.org> wrote:
>> > Welcome to the community!
>> >
>> > Mind sending a PR to add yourself to
>> > https://github.com/apache/mesos/blob/master/docs/contributors.yaml ?
>> >
>> > On Mon, Jul 11, 2016 at 10:28 AM, Lawrence Wu
>> <lawren...@twitter.com.invalid
>> >> wrote:
>> >
>> >> Hi, I will be working on
>> https://issues.apache.org/jira/browse/MESOS-5376.
>> >> idownes already added me as a contributor but I'm sending this email
>> just
>> >> for reference.
>> >>
>>


Disabling the --registry_strict flag in 1.0

2016-07-12 Thread Neil Conway
Hi folks,

I'd like to propose that we disable the --registry_strict flag for
1.0. You can find the rationale for this change here:

https://issues.apache.org/jira/browse/MESOS-5833

Please let me know if you have any thoughts on whether we should make
this change.

Thanks,
Neil


Re: getting added to contributors

2016-07-12 Thread Neil Conway
Do we really want everyone who wants to be assigned a JIRA to also add
themselves to the YAML file? To me, this adds another step to a
contribution process that probably has too many steps already.

Neil

On Mon, Jul 11, 2016 at 7:31 PM, Vinod Kone  wrote:
> Welcome to the community!
>
> Mind sending a PR to add yourself to
> https://github.com/apache/mesos/blob/master/docs/contributors.yaml ?
>
> On Mon, Jul 11, 2016 at 10:28 AM, Lawrence Wu > wrote:
>
>> Hi, I will be working on https://issues.apache.org/jira/browse/MESOS-5376.
>> idownes already added me as a contributor but I'm sending this email just
>> for reference.
>>


RFC: partitioned tasks and the strict registry

2016-07-11 Thread Neil Conway
Folks,

We're working on some Mesos features that will allow frameworks to
control how partitioned tasks are handled [1]. As part of designing
how this will work, I'd love to hear from users and framework
developers about they handle partitioned tasks/agents. Specifically:

(a) Have you enabled the strict registry? ('--registry_strict' master flag)

(b) If so, do any of your frameworks _depend_ on the semantics
provided by the strict registry? [2]

(c) Does your framework handle LOST tasks? For example, does your
framework account for the fact that LOST tasks might transition back
to RUNNING in certain circumstances?

(d) Suppose we changed the semantics of LOST in the following way: (1)
strict registry is no longer supported, and (2) LOST tasks will
*always* be allowed to reregister with the master and resume running
(even if the master has not failed over). Would this change cause
problems for any of your frameworks?

Answering "I don't know" to any of these questions is fine :) Feel
free to respond to me privately if you'd prefer.

If you have any other feedback or questions, please contact me.

Thanks!

Neil

[1] More information on the proposed changes can be found here:
https://goo.gl/7dRw4Q

[2] e.g., your framework assumes that LOST tasks will never go back to RUNNING.


Re: [3/4] mesos git commit: Added filtering for orphaned tasks in /state endpoint.

2016-07-06 Thread Neil Conway
On Wed, Jul 6, 2016 at 12:06 AM,   wrote:
> diff --git a/src/master/http.cpp b/src/master/http.cpp
> index 6b4f85b..debedd4 100644
> --- a/src/master/http.cpp
> +++ b/src/master/http.cpp
> @@ -2498,11 +2498,8 @@ Future Master::Http::state(
>  });
>
>  // Model all of the orphan tasks.
> -// TODO(vinod): Need to filter these tasks based on authorization! 
> This
> -// is currently not possible because we don't have `FrameworkInfo` 
> for
> -// these tasks. We need to either store `FrameworkInfo` for orphan
> -// tasks or persist FrameworkInfo of all frameworks in the registry.
> -writer->field("orphan_tasks", [this](JSON::ArrayWriter* writer) {
> +writer->field("orphan_tasks", [this, ](
> +JSON::ArrayWriter* writer) {
>// Find those orphan tasks.
>foreachvalue (const Slave* slave, master->slaves.registered) {
>  typedef hashmap TaskMap;
> @@ -2511,6 +2508,16 @@ Future Master::Http::state(
>  CHECK_NOTNULL(task);
>  if (!master->frameworks.registered.contains(
>  task->framework_id())) {
> +  CHECK(master->frameworks.recovered.contains(
> +  task->framework_id()));

This CHECK seems dubious: what if the orphaned task was running on an
old version of the agent? i.e., a mixed cluster in which the master
has been updated but the agent has not been.

Neil


Re: Overloading and function names

2016-07-04 Thread Neil Conway
On Sun, Jul 3, 2016 at 9:10 PM, Benjamin Mahler  wrote:
> To clarify, are you ok with the removeSlave example? It seems to fit your
> criteria.

I think `removeSlave` is poorly named, for similar reasons -- I just
talked about `update` in my email for brevity.

> Usually with this kind of email we need concrete suggestions for
> improvement.

My email included the following, which I think is pretty concrete:

***
I'd like to propose that we avoid naming functions in this style: if
two functions do fundamentally different things or should be invoked
in very different circumstances, we should avoid giving them the same
name. We should use overloading when two variants of a function do
basically the same thing but differ slightly in the parameters they
accept.
***

We can certainly debate the specifics of whether/how to rename
particular functions, but I think the bigger question is whether we
want to endorse using overloading to differentiate between functions
that do fundamentally different things.

> With this I don't think it's as bad as you've described in terms of being
> able to intuit behavior:
>
> sorter->update(client, weight);
> sorter->update(slaveId, newTotal);
> sorter->update(client, slaveId, oldAllocation, newAllocation);
>
> That being said, if there are more helpful function names let's make some
> suggestions! The obvious alternative here seems to be verbose names that
> repeat the arguments?
>
> sorter->update_client_weight(client, weight);
> sorter->update_slave_total(slave, total);
> sorter->update_client_allocation(client, slaveId, oldAllocation,
> newAllocation);
>
> We tend to avoid this pattern as well, because it leads to redundancy.

I would happily accept a little more redundancy for these examples,
because I think it improves the clarity of the code. We tend to favor
clarity and redundancy over brevity in several other situations (e.g.,
variable names must be entire words, using `load` and `store` for
atomics rather than operator overloading).

For the specific examples above, I think the longer names (e.g.,
`update_client_weight`) are an improvement. There's also a compromise
(adding back the fourth variant of `update` which is a private
function):

sorter->updateWeight(client, weight);
sorter->updateTotal(slaveId, total);
sorter->updateAllocation(client, slaveId, oldAllocation, newAllocation);
sorter->updateShare(client);

Neil


Overloading and function names

2016-07-01 Thread Neil Conway
Consider the following function signatures from master.cpp:

Nothing Master::removeSlave(const Registry::Slave& slave);

void Master::removeSlave(Slave* slave, const string& message,
Option reason);

or these from sorter/drf/sorter.hpp:

void update(const SlaveID& slaveId, const Resources& resources);

void update(
  const std::string& name,
  const SlaveID& slaveId,
  const Resources& oldAllocation,
  const Resources& newAllocation);

void update(const std::string& name);

void update(const std::string& name, double weight);

These function names use overloading but the different variants have
*completely* different semantics. For example, the variants of
update() do the following:

(1) Set the weight associated with a role.
(2) Change the total pool of resources in the sorter.
(3) Update the fair-share associated with a sorter client.
(4) Change both the total and allocated resources in the sorter.

(For fun, the descriptions and function names are in different orders. :) )

I'd like to propose that we avoid naming functions in this style: if
two functions do fundamentally different things or should be invoked
in very different circumstances, we should avoid giving them the same
name. We should use overloading when two variants of a function do
basically the same thing but differ slightly in the parameters they
accept.

Comments welcome.

Neil


Re: source code compile failure mesos-0.28.0

2016-06-21 Thread Neil Conway
Can you post the content of "config.log"?

Thanks,
Neil

On Tue, Jun 21, 2016 at 3:17 PM, Ali Aktar  wrote:
> Hi;
>
> All dependencies as per doc were installed. I’m using Centos 7:
> Linux ip-172-31-46-249.eu-west-1.compute.internal 3.10.0-327.10.1.el7.x86_64 
> #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> CentOS Linux release 7.2.1511 (Core)
>
>
> Thanks
> Ali.
>
> On 21 Jun 2016, at 13:55, José Guilherme Vanz  wrote:
>
>> Do you ensure that all dependencies described here:
>> http://mesos.apache.org/documentation/latest/getting-started/ are
>> installed? Furthermore, which Linux flavour are you using?
>>
>> Thanks
>>
>> On Tue, 21 Jun 2016 at 09:22 Ali Aktar  wrote:
>>
>>> Hi;
>>>
>>> Can someone please help, I’m getting the following errors while running
>>> ./configure:
>>>
>>> [root@ip-172-31-46-249 build]# ../configure --prefix=/mnt/s3mnt/mesos
>>> --exec-prefix=/mnt/s3mnt/mesos --datadir=/mnt/s3mnt/mesos
>>> checking build system type... x86_64-unknown-linux-gnu
>>> checking host system type... x86_64-unknown-linux-gnu
>>> checking target system type... x86_64-unknown-linux-gnu
>>> checking for a BSD-compatible install... /bin/install -c
>>> checking whether build environment is sane... yes
>>> checking for a thread-safe mkdir -p... /bin/mkdir -p
>>> checking for gawk... gawk
>>> checking whether make sets $(MAKE)... yes
>>> checking whether make supports nested variables... yes
>>> checking whether to enable maintainer-specific portions of Makefiles... yes
>>> checking for style of include used by make... GNU
>>> checking for gcc... gcc
>>> checking whether the C compiler works... yes
>>> checking for C compiler default output file name... a.out
>>> checking for suffix of executables...
>>> checking whether we are cross compiling... no
>>> checking for suffix of object files... o
>>> checking whether we are using the GNU C compiler... yes
>>> checking whether gcc accepts -g... yes
>>> checking for gcc option to accept ISO C89... none needed
>>> checking whether gcc understands -c and -o together... yes
>>> checking dependency style of gcc... gcc3
>>> checking for ar... ar
>>> checking the archiver (ar) interface... ar
>>> checking how to print strings... printf
>>> checking for a sed that does not truncate output... /bin/sed
>>> checking for grep that handles long lines and -e... /bin/grep
>>> checking for egrep... /bin/grep -E
>>> checking for fgrep... /bin/grep -F
>>> checking for ld used by gcc... /bin/ld
>>> checking if the linker (/bin/ld) is GNU ld... yes
>>> checking for BSD- or MS-compatible name lister (nm)... /bin/nm -B
>>> checking the name lister (/bin/nm -B) interface... BSD nm
>>> checking whether ln -s works... yes
>>> checking the maximum length of command line arguments... 1572864
>>> checking how to convert x86_64-unknown-linux-gnu file names to
>>> x86_64-unknown-linux-gnu format... func_convert_file_noop
>>> checking how to convert x86_64-unknown-linux-gnu file names to toolchain
>>> format... func_convert_file_noop
>>> checking for /bin/ld option to reload object files... -r
>>> checking for objdump... objdump
>>> checking how to recognize dependent libraries... pass_all
>>> checking for dlltool... no
>>> checking how to associate runtime and link libraries... printf %s\n
>>> checking for g++... g++
>>> checking whether we are using the GNU C++ compiler... yes
>>> checking whether g++ accepts -g... yes
>>> checking dependency style of g++... gcc3
>>> checking for archiver @FILE support... @
>>> checking for strip... strip
>>> checking for ranlib... ranlib
>>> checking command to parse /bin/nm -B output from gcc object... ok
>>> checking for sysroot... no
>>> checking for a working dd... /bin/dd
>>> checking how to truncate binary pipes... /bin/dd bs=4096 count=1
>>> checking for mt... no
>>> checking if : is a manifest tool... no
>>> checking how to run the C preprocessor... gcc -E
>>> checking for ANSI C header files... yes
>>> checking for sys/types.h... yes
>>> checking for sys/stat.h... yes
>>> checking for stdlib.h... yes
>>> checking for string.h... yes
>>> checking for memory.h... yes
>>> checking for strings.h... yes
>>> checking for inttypes.h... yes
>>> checking for stdint.h... yes
>>> checking for unistd.h... yes
>>> checking for dlfcn.h... yes
>>> checking for objdir... .libs
>>> checking if gcc supports -fno-rtti -fno-exceptions... no
>>> checking for gcc option to produce PIC... -fPIC -DPIC
>>> checking if gcc PIC flag -fPIC -DPIC works... yes
>>> checking if gcc static flag -static works... no
>>> checking if gcc supports -c -o file.o... yes
>>> checking if gcc supports -c -o file.o... (cached) yes
>>> checking whether the gcc linker (/bin/ld -m elf_x86_64) supports shared
>>> libraries... yes
>>> checking whether -lc should be explicitly linked in... no
>>> checking dynamic linker characteristics... GNU/Linux ld.so
>>> checking how to hardcode library paths into 

Improving support for partitioned tasks

2016-06-20 Thread Neil Conway
Currently, Mesos implements a hardcoded policy for handling
partitioned agents and tasks:

* agents are deemed to be partitioned when they fail health checks
(~75 seconds by default)
* partitioned agents are removed from the cluster. Frameworks receive
TASK_LOST for all tasks running on the removed agent.
* when the agent reconnects, the master instructs it to shutdown and
terminate all of its tasks.

This is problematic: framework authors would like to implement their
own partition-handling logic. To improve this situation, this design
doc proposes changing how the Mesos master handles partitions:

https://issues.apache.org/jira/browse/MESOS-5659

Feedback is very welcome!

Thanks,
Neil


Re: Master configuration in the registry

2016-06-10 Thread Neil Conway
Makes sense: arguably you could say that "quota" and "weights" are
part of the master's (mutable) "state", not its "configuration", which
is largely immutable.

Another distinction is that some configuration flags control behavior
that doesn't need to be consistent between master replicas (e.g.,
"--ip", "--port", "--advertise-ip", "--advertise-port", "--hostname",
"--hostname-lookup", "--quiet", "--log_dir", "--modules_dir",
"--work_dir", etc).

Neil

On Fri, Jun 10, 2016 at 3:52 AM, Benjamin Mahler  wrote:
> I'm curious to hear thoughts on the distinction between using flags and
> persisting in the registry for master configuration. This topic had come up
> in a discussion and our current choices are intuitive but the criteria were
> not immediately obvious.
>
> Two cases seem interesting to me:
>
> (1) Quota.
> (2) Weights.
>
> These are configuration, but we persist them in the registry. Why is that?
>
> My intuition is that they reflect the organizational aspects of the
> workloads that are running and so we expect administrators and (most
> importantly!!) tooling to be view and modify these over time.
>
> Timeouts, work directories, etc, on the other hand, are rarely modified and
> require initial values. There are also sane defaults for these that will
> work for most users.
>
> Thought this might be helpful for others that may wonder about this. Let me
> know if there are any other important criteria that I've missed.
>
> Ben


Blog posts for 0.28.1, 0.28.2 releases?

2016-06-10 Thread Neil Conway
Folks,

It seems like https://mesos.apache.org/blog/ doesn't have blog posts
for the Mesos 0.28.1 or 0.28.2 releases. We generally try to have a
blog post for each release, right?

Neil


Re: WebUI authentication in 1.0.0-rc1

2016-06-08 Thread Neil Conway
On Wed, Jun 8, 2016 at 4:27 PM, Alexander Rojas  wrote:
> I think we should also think more thoroughly about the expected behaviour
> when we introduce new authorizable actions (and we most certainly will).
> Since things may break particularly if users set the `permissive` ACL field
> to false.
>
> Perhaps initially, if no ACL is given for the new action we print a warning
> message and behave as if the field had an ACL such as
>
> ```
> {
>   "principals": {"type": "ANY"}
>   "action":{"type": "ANY"}
> }
> ```

An ACL configuration that omits any rules for a particular action is
not an invalid way to configure the system. e.g., suppose we added the
"/teardown" endpoint in Mesos 1.1, along with the
"teardown_frameworks" ACL. A perfectly reasonable way to configure the
behavior "no one should be allowed to use the /teardown endpoint" is
an ACL configuration that has "permissive: false" and doesn't
otherwise mention "teardown_frameworks".

The situation here is a little unusual, because we're introducing ACLs
for behavior that was previously not covered by the authorization
system, rather than new functionality. But overall, I think the
situation can be addressed by documenting the new behavior
*prominently* in the release notes / upgrade docs -- anyone upgrading
to a non-patch release should be reading that document anyway, and the
required changes will usually be straightforward.

Neil


Re: Does anyone know the MESOS-4675 is back-ported to 0.25?

2016-06-08 Thread Neil Conway
Done -- https://issues.apache.org/jira/browse/MESOS-5569

On Wed, Jun 8, 2016 at 2:43 PM, Vinod Kone <vinodk...@gmail.com> wrote:
> +1. Cab you file a jira?
>
> @vinodkone
>
>> On Jun 8, 2016, at 7:58 AM, Neil Conway <neil.con...@gmail.com> wrote:
>>
>> It would be great to make this information more prominent on the
>> website, especially once 1.0.0 is released. For example, we could list
>> the supported releases on https://mesos.apache.org/downloads/, along
>> with a link to the versioning document.
>>
>> Neil
>>
>>
>>> On Tue, Jun 7, 2016 at 6:58 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>> As haosdent suggested, the list of branches is the easiest way to check
>>> which releases are supported.
>>>
>>> As for the EOL policy, every release is supported for 6 months. See
>>> https://github.com/apache/mesos/blob/master/docs/versioning.md
>>>
>>>> On Tue, Jun 7, 2016 at 12:48 PM, haosdent <haosd...@gmail.com> wrote:
>>>>
>>>> @tommy Currently only 0.26.x, 0.27.0, 0.28.x under maintained. You could
>>>> check from https://github.com/apache/mesos/branches
>>>> Personally I could post a simple patch that just disable systemd for 0.25.
>>>> But you need apply it manually. Is it OK for you?
>>>>
>>>>> On Wed, Jun 8, 2016 at 12:42 AM, tommy xiao <xia...@gmail.com> wrote:
>>>>>
>>>>> Vinod,
>>>>>
>>>>> How to know which version is EOL, any docs reference it?
>>>>>
>>>>> 2016-06-07 23:24 GMT+08:00 Vinod Kone <vinodk...@apache.org>:
>>>>>
>>>>>> 0.25.x is EOL'ed and hence no longer supported.
>>>>>>
>>>>>> On Tue, Jun 7, 2016 at 10:51 AM, Chengwei Yang <
>>>>> chengwei.yang...@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> Seems it hasn't been backported to 0.25.x so far.
>>>>>>>
>>>>>>>> On Tue, Jun 07, 2016 at 08:49:20PM +0800, tommy xiao wrote:
>>>>>>>> i need it on 0.25 feature.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Deshi Xiao
>>>>>>>> Twitter: xds2000
>>>>>>>> E-mail: xiaods(AT)gmail.com
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Chengwei
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deshi Xiao
>>>>> Twitter: xds2000
>>>>> E-mail: xiaods(AT)gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>


Re: Does anyone know the MESOS-4675 is back-ported to 0.25?

2016-06-08 Thread Neil Conway
It would be great to make this information more prominent on the
website, especially once 1.0.0 is released. For example, we could list
the supported releases on https://mesos.apache.org/downloads/, along
with a link to the versioning document.

Neil


On Tue, Jun 7, 2016 at 6:58 PM, Vinod Kone  wrote:
> As haosdent suggested, the list of branches is the easiest way to check
> which releases are supported.
>
> As for the EOL policy, every release is supported for 6 months. See
> https://github.com/apache/mesos/blob/master/docs/versioning.md
>
> On Tue, Jun 7, 2016 at 12:48 PM, haosdent  wrote:
>
>> @tommy Currently only 0.26.x, 0.27.0, 0.28.x under maintained. You could
>> check from https://github.com/apache/mesos/branches
>> Personally I could post a simple patch that just disable systemd for 0.25.
>> But you need apply it manually. Is it OK for you?
>>
>> On Wed, Jun 8, 2016 at 12:42 AM, tommy xiao  wrote:
>>
>> > Vinod,
>> >
>> > How to know which version is EOL, any docs reference it?
>> >
>> > 2016-06-07 23:24 GMT+08:00 Vinod Kone :
>> >
>> > > 0.25.x is EOL'ed and hence no longer supported.
>> > >
>> > > On Tue, Jun 7, 2016 at 10:51 AM, Chengwei Yang <
>> > chengwei.yang...@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Seems it hasn't been backported to 0.25.x so far.
>> > > >
>> > > > On Tue, Jun 07, 2016 at 08:49:20PM +0800, tommy xiao wrote:
>> > > > > i need it on 0.25 feature.
>> > > > >
>> > > > > --
>> > > > > Deshi Xiao
>> > > > > Twitter: xds2000
>> > > > > E-mail: xiaods(AT)gmail.com
>> > > >
>> > > > --
>> > > > Thanks,
>> > > > Chengwei
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Deshi Xiao
>> > Twitter: xds2000
>> > E-mail: xiaods(AT)gmail.com
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>


Re: [1/2] mesos git commit: Added aufs provisioning backend.

2016-06-08 Thread Neil Conway
Can you update the documentation for this change, please?

Thanks,
Neil

On Tue, Jun 7, 2016 at 6:14 PM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 90871a48f -> e5358ed1c
>
>
> Added aufs provisioning backend.
>
> Review: https://reviews.apache.org/r/47396/
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e5358ed1
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e5358ed1
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e5358ed1
>
> Branch: refs/heads/master
> Commit: e5358ed1c132923d5fa357d1e337e037d1f29c8a
> Parents: ca09304
> Author: Shuai Lin 
> Authored: Mon Jun 6 18:05:15 2016 -0700
> Committer: Jie Yu 
> Committed: Tue Jun 7 09:14:22 2016 -0700
>
> --
>  src/Makefile.am |   2 +
>  .../containerizer/mesos/provisioner/backend.cpp |   9 +
>  .../mesos/provisioner/backends/aufs.cpp | 227 +++
>  .../mesos/provisioner/backends/aufs.hpp |  70 ++
>  .../containerizer/provisioner_backend_tests.cpp |  51 +
>  src/tests/environment.cpp   |  13 ++
>  6 files changed, 372 insertions(+)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/e5358ed1/src/Makefile.am
> --
> diff --git a/src/Makefile.am b/src/Makefile.am
> index a08ea40..b02b901 100644
> --- a/src/Makefile.am
> +++ b/src/Makefile.am
> @@ -1001,6 +1001,7 @@ MESOS_LINUX_FILES = 
>   \
>slave/containerizer/mesos/isolators/filesystem/shared.cpp\
>slave/containerizer/mesos/isolators/namespaces/pid.cpp   \
>slave/containerizer/mesos/isolators/network/cni/cni.cpp  \
> +  slave/containerizer/mesos/provisioner/backends/aufs.cpp  \
>slave/containerizer/mesos/provisioner/backends/bind.cpp  \
>slave/containerizer/mesos/provisioner/backends/overlay.cpp
>
> @@ -1024,6 +1025,7 @@ MESOS_LINUX_FILES +=
>   \
>slave/containerizer/mesos/isolators/filesystem/shared.hpp\
>slave/containerizer/mesos/isolators/namespaces/pid.hpp   \
>slave/containerizer/mesos/isolators/network/cni/cni.hpp  \
> +  slave/containerizer/mesos/provisioner/backends/aufs.hpp  \
>slave/containerizer/mesos/provisioner/backends/bind.hpp  \
>slave/containerizer/mesos/provisioner/backends/overlay.hpp
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/e5358ed1/src/slave/containerizer/mesos/provisioner/backend.cpp
> --
> diff --git a/src/slave/containerizer/mesos/provisioner/backend.cpp 
> b/src/slave/containerizer/mesos/provisioner/backend.cpp
> index b2a20b7..93a2c3a 100644
> --- a/src/slave/containerizer/mesos/provisioner/backend.cpp
> +++ b/src/slave/containerizer/mesos/provisioner/backend.cpp
> @@ -25,6 +25,7 @@
>  #include "slave/containerizer/mesos/provisioner/backend.hpp"
>
>  #ifdef __linux__
> +#include "slave/containerizer/mesos/provisioner/backends/aufs.hpp"
>  #include "slave/containerizer/mesos/provisioner/backends/bind.hpp"
>  #endif
>  #include "slave/containerizer/mesos/provisioner/backends/copy.hpp"
> @@ -47,6 +48,14 @@ hashmap Backend::create(const 
> Flags& flags)
>  #ifdef __linux__
>creators.put("bind", ::create);
>
> +  Try aufsSupported = fs::aufs::supported();
> +  if (aufsSupported.isError()) {
> +LOG(WARNING) << "Failed to check aufs availability: '"
> + << aufsSupported.error();
> +  } else if (aufsSupported.get()) {
> +creators.put("aufs", ::create);
> +  }
> +
>Try overlayfsSupported = fs::overlay::supported();
>if (overlayfsSupported.isError()) {
>  LOG(WARNING) << "Failed to check overlayfs availability: '"
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/e5358ed1/src/slave/containerizer/mesos/provisioner/backends/aufs.cpp
> --
> diff --git a/src/slave/containerizer/mesos/provisioner/backends/aufs.cpp 
> b/src/slave/containerizer/mesos/provisioner/backends/aufs.cpp
> new file mode 100644
> index 000..54c0057
> --- /dev/null
> +++ b/src/slave/containerizer/mesos/provisioner/backends/aufs.cpp
> @@ -0,0 +1,227 @@
> +// Licensed to the Apache Software Foundation (ASF) under one
> +// or more contributor license agreements.  See the NOTICE file
> +// distributed with this work for additional information
> +// regarding copyright ownership.  The ASF licenses this file
> +// to you under the Apache License, Version 2.0 (the
> +// 

Re: mesos git commit: Added documentation for access_sandboxes and access_mesos_logs acls.

2016-06-06 Thread Neil Conway
FYI, this commit should have included the changes produced by
re-running the `generate-endpoint.py` script.

Neil

On Wed, Jun 1, 2016 at 8:26 AM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 5263a6211 -> 53b5164bb
>
>
> Added documentation for access_sandboxes and access_mesos_logs acls.
>
> Modifies the file `acls.proto` to take into consideration the added
> authorization actions `access_sandboxes` and `access_mesos_logs`.
>
> Review: https://reviews.apache.org/r/48048/
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/53b5164b
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/53b5164b
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/53b5164b
>
> Branch: refs/heads/master
> Commit: 53b5164bb51ebe850dec5ab19b8382f5c4a59391
> Parents: 5263a62
> Author: Alexander Rojas 
> Authored: Tue May 31 23:20:50 2016 -0700
> Committer: Adam B 
> Committed: Tue May 31 23:24:55 2016 -0700
>
> --
>  docs/authorization.md |  2 ++
>  src/files/files.cpp   | 34 +++---
>  2 files changed, 33 insertions(+), 3 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/53b5164b/docs/authorization.md
> --
> diff --git a/docs/authorization.md b/docs/authorization.md
> index 0e58b9b..189b70d 100644
> --- a/docs/authorization.md
> +++ b/docs/authorization.md
> @@ -131,6 +131,8 @@ entries, each representing an authorizable action:
>  |`view_framework`|UNIX user of whom executors can be 
> viewed.|`Framework_Info` which can be viewed.|Filtering http endpoints.|
>  |`view_executor`|UNIX user of whom executors can be viewed.|`Executor_Info` 
> and `Framework_Info` which can be viewed.|Filtering http endpoints.|
>  |`view_task`|UNIX user of whom tasks can be viewed.|(`Task` or `Task_Info`) 
> and `Framework_Info` which can be viewed.|Filtering http endpoints.|
> +|`access_sandboxes`|Operator username.|Operating system user whose 
> executor/task sandboxes can be accessed.|Access task sandboxes.|
> +|`access_mesos_logs`|Operator username.|Implicitly given. A user should only 
> use types ANY and NONE to allow/deny access to the log.|Access Mesos logs.|
>
>  ### Examples
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/53b5164b/src/files/files.cpp
> --
> diff --git a/src/files/files.cpp b/src/files/files.cpp
> index 873664d..094a00c 100644
> --- a/src/files/files.cpp
> +++ b/src/files/files.cpp
> @@ -57,6 +57,7 @@
>  using namespace process;
>
>  using process::AUTHENTICATION;
> +using process::AUTHORIZATION;
>  using process::DESCRIPTION;
>  using process::HELP;
>  using process::TLDR;
> @@ -295,7 +296,16 @@ const string FilesProcess::BROWSE_HELP = HELP(
>  "Query parameters:",
>  "",
>  ">path=VALUE  The path of directory to browse."),
> -AUTHENTICATION(true));
> +AUTHENTICATION(true),
> +AUTHORIZATION(
> +"Browsing files requires that the request principal is ",
> +"authorized to do so for the target virtual file path.",
> +"",
> +"Authorizers may categorize different virtual paths into",
> +"different ACLs, e.g. logs in one and task sandboxes in",
> +"another.",
> +"",
> +"See authorization documentation for details."));
>
>
>  Future FilesProcess::authorize(
> @@ -409,7 +419,16 @@ const string FilesProcess::READ_HELP = HELP(
>  ">offset=VALUEValue added to base address to obtain "
>  "a second address",
>  ">length=VALUELength of file to read."),
> -AUTHENTICATION(true));
> +AUTHENTICATION(true),
> +AUTHORIZATION(
> +"Reading files requires that the request principal is ",
> +"authorized to do so for the target virtual file path.",
> +"",
> +"Authorizers may categorize different virtual paths into",
> +"different ACLs, e.g. logs in one and task sandboxes in",
> +"another.",
> +"",
> +"See authorization documentation for details."));
>
>
>  Future FilesProcess::read(
> @@ -585,7 +604,16 @@ const string FilesProcess::DOWNLOAD_HELP = HELP(
>  "Query parameters:",
>  "",
>  ">path=VALUE  The path of directory to browse."),
> -AUTHENTICATION(true));
> +AUTHENTICATION(true),
> +AUTHORIZATION(
> +"Downloading files requires that the request principal is ",
> +"authorized to do so for the target virtual file path.",
> +"",
> +"Authorizers may categorize different virtual paths into",
> +"different ACLs, e.g. 

Re: mesos git commit: Removed deprecated annotation for values in a protobuf enum.

2016-05-19 Thread Neil Conway
Do we need to be source-compatible with protobuf 2.5? If so, why?

Neil

On Wed, May 18, 2016 at 11:15 PM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master b7e50fe8b -> 4248b3c3a
>
>
> Removed deprecated annotation for values in a protobuf enum.
>
> Support for deprecated annotation for enums was added in protobuf 2.6.
> Since we should be compatible with 2.5, refrain from using the
> feature.
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/4248b3c3
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/4248b3c3
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/4248b3c3
>
> Branch: refs/heads/master
> Commit: 4248b3c3a1cbfdf3bef2bebc401cca55407f2b87
> Parents: b7e50fe
> Author: Alexander Rukletsov 
> Authored: Wed May 18 23:14:47 2016 +0200
> Committer: Alexander Rukletsov 
> Committed: Wed May 18 23:14:47 2016 +0200
>
> --
>  include/mesos/authorizer/authorizer.proto | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/4248b3c3/include/mesos/authorizer/authorizer.proto
> --
> diff --git a/include/mesos/authorizer/authorizer.proto 
> b/include/mesos/authorizer/authorizer.proto
> index b0d9f79..911a227 100644
> --- a/include/mesos/authorizer/authorizer.proto
> +++ b/include/mesos/authorizer/authorizer.proto
> @@ -61,8 +61,8 @@ enum Action {
>// TODO(zhitao): Remove the following two actions at the end of
>// the deprecation cycle which started with 0.29. They will be
>// fully replaced by `UPDATE_QUOTA_WITH_ROLE`.
> -  SET_QUOTA_WITH_ROLE = 8 [deprecated = true];
> -  DESTROY_QUOTA_WITH_PRINCIPAL = 9 [deprecated = true];
> +  SET_QUOTA_WITH_ROLE = 8;  // [deprecated = true];
> +  DESTROY_QUOTA_WITH_PRINCIPAL = 9;  // [deprecated = true];
>
>UPDATE_WEIGHTS_WITH_ROLE = 10;
>GET_ENDPOINT_WITH_PATH = 11;
>


Re: mesos git commit: Updated quota endpoint help.

2016-05-18 Thread Neil Conway
When modifying the endpoint help text, we should remember to update
the generated help files (via support/generate-endpoint-help.py) --
the changes to both the input text and generated output files should
be included as part of the same commit.

Neil

On Wed, May 18, 2016 at 10:58 AM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master a7835f889 -> 9f63d95f3
>
>
> Updated quota endpoint help.
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/9f63d95f
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/9f63d95f
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/9f63d95f
>
> Branch: refs/heads/master
> Commit: 9f63d95f3cac17c94a7aff57980478263c78f6ee
> Parents: a7835f8
> Author: Adam B 
> Authored: Wed May 18 01:56:57 2016 -0700
> Committer: Adam B 
> Committed: Wed May 18 01:57:52 2016 -0700
>
> --
>  src/master/http.cpp | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/9f63d95f/src/master/http.cpp
> --
> diff --git a/src/master/http.cpp b/src/master/http.cpp
> index c4ca343..5d73a1d 100644
> --- a/src/master/http.cpp
> +++ b/src/master/http.cpp
> @@ -1286,15 +1286,20 @@ string Master::Http::QUOTA_HELP()
>  {
>return HELP(
>  TLDR(
> -"Sets quota for a role."),
> +"Gets or updates quota for roles."),
>  DESCRIPTION(
> -"Returns 200 OK when the quota has been changed successfully.",
> +"Returns 200 OK when the quota was queried or updated successfully.",
>  "Returns 307 TEMPORARY_REDIRECT redirect to the leading master when",
>  "current master is not the leader.",
>  "Returns 503 SERVICE_UNAVAILABLE if the leading master cannot be",
>  "found.",
> +"GET: Returns the currently set quotas as JSON.",
> +"",
>  "POST: Validates the request body as JSON",
> -" and sets quota for a role."),
> +" and sets quota for a role.",
> +"",
> +"DELETE: Validates the request body as JSON",
> +" and removes quota for a role."),
>  AUTHENTICATION(true),
>  AUTHORIZATION(
>  "Using this endpoint to set a quota for a certain role requires 
> that",
>


Re: mesos website workgroup

2016-05-17 Thread Neil Conway
Count me in.

Thanks,
Neil

On Tue, May 17, 2016 at 7:54 AM, Tomek Janiszewski  wrote:
> Count me in.
>
> Tomek
>
> wt., 17.05.2016, 07:49 użytkownik Abhishek Dasgupta <
> a10gu...@linux.vnet.ibm.com> napisał:
>
>> I would be very much interested. I have some front-end experience as
>> well. Please, include me too.
>>
>> On মঙ্গলবার 17 মে 2016 02:32 পূর্বাহ্ণ, Vinod Kone wrote:
>> > Hi guys,
>> >
>> > Mesos website needs some love. It hasn't seen major changes for a while
>> now
>> > and there is no real maintainer for it.
>> >
>> > I'm proposing we start a work group for the folks who are interested in
>> > contributing to the website. Especially folks who have interest and
>> > experience in frontend development or at least have access to folks with
>> > that experience (maybe colleagues at your company).
>> >
>> > Since we are gearing up for a 1.0 release, it would be nice to do a
>> website
>> > refresh.
>> >
>> > Please reply to this email if you are interested.
>> >
>> > Thanks,
>> > Vinod
>> >
>>
>>


Re: Design doc for TASK_GONE

2016-05-12 Thread Neil Conway
Folks,

Based on some design feedback, I've modified the proposal for this feature.

Rather than introducing a new TASK_GONE status, the major change is to
_change_ the meaning of TASK_LOST so that it only identifies when a
task is definitely not running (either because it failed to launch or
because the master knows it has shutdown). We'll introduce a new task
state, TASK_LOST_IN_PROGRESS, for the case where a task may or may not
be running (we've lost contact with the agent), but the master will
instruct the slave to shutdown when it reconnects.

More details here:
https://docs.google.com/document/d/1D2mJnwuC1qlT_SJGspfj4MdAQXflESCqKANY0Pj4644

Neil


On Mon, May 9, 2016 at 2:37 PM, Neil Conway <neil.con...@gmail.com> wrote:
> Hi folks,
>
> To address some shortcomings and ambiguities in the TASK_LOST task
> state, I'd like to propose that we introduce a new task state,
> TASK_GONE. For more information, see the design doc:
>
> https://issues.apache.org/jira/browse/MESOS-5345
>
> Comments welcome!
>
> Neil


Re: mesos git commit: Fixed a head-of-line blocking bug in libevent SSL socket.

2016-05-12 Thread Neil Conway
Would it be possible to write a unit test that reproduces the original
problem? It should be pretty easy to repro, right?

Neil

On Thu, May 12, 2016 at 1:50 AM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 95e670cd4 -> 28c085fca
>
>
> Fixed a head-of-line blocking bug in libevent SSL socket.
>
> Currently, the `accept_queue` is used to store connected sockets.
> However, we push socket futures into this queue *before* they
> complete the SSL handshake or are downgraded. This means that
> a future returned from the queue may remain pending. If the user
> writes a `Socket::accept` loop consuming accepted sockets they
> will experience head-of-line blocking while a slow handshake or
> downgrade is in progress.
>
> Review: https://reviews.apache.org/r/47192
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/28c085fc
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/28c085fc
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/28c085fc
>
> Branch: refs/heads/master
> Commit: 28c085fcad70de128d147b486264395318c4d1ec
> Parents: 95e670c
> Author: Benjamin Mahler 
> Authored: Tue May 10 15:33:13 2016 -0700
> Committer: Benjamin Mahler 
> Committed: Wed May 11 16:50:33 2016 -0700
>
> --
>  3rdparty/libprocess/src/libevent_ssl_socket.cpp | 17 -
>  3rdparty/libprocess/src/libevent_ssl_socket.hpp | 10 +-
>  2 files changed, 17 insertions(+), 10 deletions(-)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/28c085fc/3rdparty/libprocess/src/libevent_ssl_socket.cpp
> --
> diff --git a/3rdparty/libprocess/src/libevent_ssl_socket.cpp 
> b/3rdparty/libprocess/src/libevent_ssl_socket.cpp
> index b829e7d..2f844c2 100644
> --- a/3rdparty/libprocess/src/libevent_ssl_socket.cpp
> +++ b/3rdparty/libprocess/src/libevent_ssl_socket.cpp
> @@ -843,8 +843,9 @@ Future LibeventSSLSocketImpl::accept()
>// We explicitly specify the return type to avoid a type deduction
>// issue in some versions of clang. See MESOS-2943.
>return accept_queue.get()
> -.then([](const Future& future) -> Future {
> -  return future;
> +.then([](const Future& socket) -> Future {
> +  CHECK(!socket.isPending());
> +  return socket;
>  });
>  }
>
> @@ -920,9 +921,15 @@ void 
> LibeventSSLSocketImpl::accept_callback(AcceptRequest* request)
>  {
>CHECK(__in_event_loop__);
>
> -  // Enqueue a potential socket that we will set up SSL state for and
> -  // verify.
> -  accept_queue.put(request->promise.future());
> +  Queue accept_queue_ = accept_queue;
> +
> +  // After the socket is accepted, it must complete the SSL
> +  // handshake (or be downgraded to a regular socket) before
> +  // we put it in the queue of connected sockets.
> +  request->promise.future()
> +.onAny([accept_queue_](Future socket) mutable {
> +  accept_queue_.put(socket);
> +});
>
>// If we support downgrading the connection, first wait for this
>// socket to become readable. We will then MSG_PEEK it to test
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/28c085fc/3rdparty/libprocess/src/libevent_ssl_socket.hpp
> --
> diff --git a/3rdparty/libprocess/src/libevent_ssl_socket.hpp 
> b/3rdparty/libprocess/src/libevent_ssl_socket.hpp
> index e773fad..9b6ba64 100644
> --- a/3rdparty/libprocess/src/libevent_ssl_socket.hpp
> +++ b/3rdparty/libprocess/src/libevent_ssl_socket.hpp
> @@ -170,11 +170,11 @@ private:
>// event loop until it is destroyed.
>std::weak_ptr* event_loop_handle;
>
> -  // This queue stores buffered accepted sockets. 'Queue' is a thread
> -  // safe queue implementation, and the event loop pushes connected
> -  // sockets onto it, the 'accept()' call pops them off. We wrap these
> -  // sockets with futures so that we can pass errors through and chain
> -  // futures as well.
> +  // This queue stores accepted sockets that are considered connected
> +  // (either the SSL handshake has completed or the socket has been
> +  // downgraded). The 'accept()' call returns sockets from this queue.
> +  // We wrap the socket in a 'Future' so that we can pass failures or
> +  // discards through.
>Queue accept_queue;
>
>Option peer_hostname;
>


Re: mesos git commit: Replaced CHECK with CHECK_READY.

2016-05-10 Thread Neil Conway
Hi Ben,

Thanks for raising this!

My thinking for grouping the changes together in a single review is
basically what AlexR said: I agree with doing "one thing per patch",
but I felt like a header cleanup was sufficiently close to CHECK
cleanup that they could be grouped together. If there's consensus that
folks would rather see such changes separated, I'm happy to do that in
the future.

Re: the  change, my mistake -- although I confess I
didn't realize we have a strict "no dependence on transitive includes"
policy. Is that documented anywhere?

Anyway, RRs here:

https://reviews.apache.org/r/47112 avoids depending on transitively
included headers
https://reviews.apache.org/r/47113 replaces "using namespace process;"
with more fine-grained "using" statements
https://reviews.apache.org/r/47114 fixes angle bracket style.

Thanks,
Neil

On Sun, May 8, 2016 at 10:28 AM, Alex R <al...@apache.org> wrote:
> I agree that "atomic patches" (those that do one thing per patch) are a good
> thing because they simplify navigating history, do blame and bisect. But how
> to define that "one thing"? Some people would say, that a new feature is one
> thing, and if introducing a feature requires some refactoring, it should be
> done in the same patch, so that motivation for the refactoring is clear. On
> the other hand there will be folks who would say that putting all
> loosely-coupled changes like refactoring, tests, protobuf update,
> implementation into the same patch makes it hardly reviewable and
> complicates finding problems via blame / bisect.
>
> Let's take a step back and think, why we need to read commit history in the
> first place. Based on my experience, I see several cases:
>   * Searching which patch introduced a bug, a regression, or some peculiar
> behaviour.
>   * Learning how a feature was implemented and why.
>
> That's why it is important to separate functional changes from cleanups.
>
> However, do we still want to separate different sorts of cleanups as well or
> squash them together to avoid the churn? If I search for a bug and see a
> style fix patch, I simply skip. I'd rather prefer to have one single patch
> for all style fixes than a tiny patch for each type of cleanup.
>
> Regarding r/46827/, you are right that doing audit of include and using
> sections was not the primary intention, but I considered it fine to include
> those changes since they were not making the code "worse" (modulo removing
>  which was a mistake).
>
> On 8 May 2016 at 02:38, Benjamin Mahler <bmah...@apache.org> wrote:
>>
>> Hm.. any reason that unrelated headers were touched and the using
>> statement was removed in this patch?
>>
>> My concern with mixing unrelated changes within a single patch is that
>> patches become less precise. If one reads the patch there is additional
>> overhead in understanding what is related to the goal of the change and what
>> is not. I know it's a small example here but I see value in being
>> disciplined about this regardless of patch size.
>>
>> The other concern is that the reviewer of this patch has to review these
>> two additional changes:
>>   1. How does the header audit look? Anything else need added or removed?
>>   2. How does the 'using' audit look? Anything else need added or removed?
>>
>> (1) and (2) could be done together in a single patch. As in turns out, the
>> header audit looks like it has a few issues, but I'm guessing the reviewer
>> glossed over it because the point of this change was CHECK_READY :)
>>   - was removed but CHECK_NONE and CHECK_SOME are used
>>   - is not present but LOG is used
>>   - is absent but Nothing is used
>>   - is absent but process::Process / process::wait /
>> process::terminate are used.
>>
>> Then for the 'using' audit, we now avoid pulling in all of the process::
>> namespace in favor of finer-grained using statements.
>>
>> On Mon, May 2, 2016 at 4:08 AM, <al...@apache.org> wrote:
>>>
>>> Repository: mesos
>>> Updated Branches:
>>>   refs/heads/master 78f6101cc -> 4f9040db6
>>>
>>>
>>> Replaced CHECK with CHECK_READY.
>>>
>>> Also removes some unused header includes.
>>>
>>> Review: https://reviews.apache.org/r/46827/
>>>
>>>
>>> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
>>> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/4f9040db
>>> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/4f9040db
>>> Diff: http://git-wip-us.apache.org/rep

Design doc for TASK_GONE

2016-05-09 Thread Neil Conway
Hi folks,

To address some shortcomings and ambiguities in the TASK_LOST task
state, I'd like to propose that we introduce a new task state,
TASK_GONE. For more information, see the design doc:

https://issues.apache.org/jira/browse/MESOS-5345

Comments welcome!

Neil


Re: Design doc: ordered delivery in libprocess

2016-04-11 Thread Neil Conway
Hi all,

It turns out this design doesn't work. The reason is that we made the
following assumption: given a client like:

s1 = connect(); send(s1, "msg1"); close(s1);
s2 = connect(); send(s2, "msg2"); close(s2);

And a server like:

char buf[...];
while (true) {
  s = accept();
  recv(s, buf);
  close(s);
}

Our design required that the server-side socket corresponding to "s1"
would always be accepted _before_ the server-side socket for "s2".
That is, that we can rely on the order in which server-side sockets
are accept()'d to infer the order in which client-side sockets are
connect()'d, as long as the client only makes at most one connect() to
a given host at a given time.

It turns out that this isn't the case on either OSX or Linux. I
haven't dug too deeply into the reasons why, but one plausible
explanation is that the kernel is using multiple accept() queues and
dispatching inbound connections to a queue in some unpredictable way.

It seems that implementing ordered delivery will require adding
metadata to the libprocess socket-establishment protocol to give the
recipient enough information to determine which of the inbound
connections from a given libprocess instance is the "newest".

Neil


On Fri, Mar 25, 2016 at 12:50 PM, Neil Conway <neil.con...@gmail.com> wrote:
> A few months ago, there was a dev list thread on whether libprocess
> should provide ordered delivery [1]. The consensus then was that
> libprocess doesn't provide ordered delivery in a few corner cases, but
> that we should fix that behavior to guarantee ordered (but unreliable)
> message delivery.
>
> Please see this design doc for a proposal of how to achieve this:
>
> https://docs.google.com/document/d/1oqv42-GYXfh0TsyXGCBhxakXw8ckXMCnEb0T-ydEyYM/edit#
>
> Comments welcome.
>
> Thanks,
> Neil
>
> [1] 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201511.mbox/%3CCAOW5sYaANyaRD-Mbk7E_VsWwF=xrtcvhszuvmunj4sq8jdt...@mail.gmail.com%3E


Re: [1/6] mesos git commit: Fixed a memory leak in the scheduler driver.

2016-03-30 Thread Neil Conway
On Wed, Mar 30, 2016 at 4:57 PM, Benjamin Mahler  wrote:
> Yikes! (3) being not true to me means that I needed non-local reasoning to
> determine the optionality.

Sorry: to clarify, I didn't mean "there is not always a latch" in the
code in question. I meant: "writing 'delete latch;' is not a good way
to imply that there is always a latch, because that code works fine if
latch is optional."

> int fd = -1;
> ...
> if (something) { fd = open(); }
> ...
> if (fd != -1) { close(fd); }
>
> Is there something fundamentally different about new/delete?

To me, `if (ptr != NULL) { delete ptr; }` is distracting, because the
fact that deleting a null pointer is well-defined is very well known.
So I immediately see that and wonder why there is obviously redundant
code. In contrast, close(-1) not having some subtle overloaded meaning
(like kill(-1) does) isn't as obvious.

> Generally we use options for optionality, however we've occasionally
> avoided Option in favor of T* out of convenience. For these Option
> variables masquerading as T* variables, it would be great to keep the
> optionality checks to help the reader intuit the optionality, or convert to
> something better than just naked deletes.

I definitely think there is value in distinguishing between optional
and non-optional pointer values, but I don't think no-op statements
like `if (ptr) { delete ptr; }` are the best way to do that. +1 on
using something better than just naked deletes!

Neil


Re: [1/6] mesos git commit: Fixed a memory leak in the scheduler driver.

2016-03-30 Thread Neil Conway
(3) is not true, though: as written, there may or may not be a latch
(the code works correctly correct either way). Using an operation
(delete) that works fine for null pointers as a way to imply that a
pointer is NOT null does not seem like the best arrangement.

As you say, there should be better ways to communicate to the reader
that `process` and `credential` are optional but `latch` is not. For
example:

* encoding this into the type system using Owned / Option / etc.
* CHECKs (e.g., CHECK that `latch` is not NULL).
* comments

Neil

On Wed, Mar 30, 2016 at 3:24 PM, Benjamin Mahler <bmah...@apache.org> wrote:
> I'm not sure the null check was in place for the safety of deletion so much
> as to make the code easier to reason about:
>
>   if (process != NULL) {
> terminate(process);
> wait(process);
> delete process;
>   }
>
>   if (credential != NULL) {
> delete credential;
>   }
>
>   delete latch;
>
> Here, (1) there's sometimes a process, (2) theres's sometimes a credential,
> (3) there's always a latch. Without the null check for credential, the
> optionality seems less clear to the reader (more non-local reasoning
> required). Arguably we could use Owned or Options of pointers here, but in
> its current form I would opt to leave the NULL check in to help the reader.
>
> On Tue, Mar 29, 2016 at 4:39 PM, Klaus Ma <klaus1982...@hotmail.com> wrote:
>
>> +1
>>
>> Refer to this doc for the detail of deleting null:
>> http://www.cplusplus.com/reference/new/operator%20delete/ <
>> http://www.cplusplus.com/reference/new/operator%20delete/>
>>
>> Thanks
>> Klaus
>>
>> > On Mar 30, 2016, at 07:24, Neil Conway <neil.con...@gmail.com> wrote:
>> >
>> > On Tue, Mar 29, 2016 at 7:19 PM,  <vinodk...@apache.org> wrote:
>> >> --- a/src/sched/sched.cpp
>> >> +++ b/src/sched/sched.cpp
>> >> @@ -1808,6 +1808,10 @@ MesosSchedulerDriver::~MesosSchedulerDriver()
>> >> delete process;
>> >>   }
>> >>
>> >> +  if (credential != NULL) {
>> >> +delete credential;
>> >> +  }
>> >
>> > `delete` of a NULL pointer is safe, so I would vote for removing the
>> `if`.
>> >
>> > Neil
>>
>>


Re: [1/6] mesos git commit: Fixed a memory leak in the scheduler driver.

2016-03-29 Thread Neil Conway
On Tue, Mar 29, 2016 at 7:19 PM,   wrote:
> --- a/src/sched/sched.cpp
> +++ b/src/sched/sched.cpp
> @@ -1808,6 +1808,10 @@ MesosSchedulerDriver::~MesosSchedulerDriver()
>  delete process;
>}
>
> +  if (credential != NULL) {
> +delete credential;
> +  }

`delete` of a NULL pointer is safe, so I would vote for removing the `if`.

Neil


Design doc: ordered delivery in libprocess

2016-03-25 Thread Neil Conway
A few months ago, there was a dev list thread on whether libprocess
should provide ordered delivery [1]. The consensus then was that
libprocess doesn't provide ordered delivery in a few corner cases, but
that we should fix that behavior to guarantee ordered (but unreliable)
message delivery.

Please see this design doc for a proposal of how to achieve this:

https://docs.google.com/document/d/1oqv42-GYXfh0TsyXGCBhxakXw8ckXMCnEb0T-ydEyYM/edit#

Comments welcome.

Thanks,
Neil

[1] 
https://mail-archives.apache.org/mod_mbox/mesos-dev/201511.mbox/%3CCAOW5sYaANyaRD-Mbk7E_VsWwF=xrtcvhszuvmunj4sq8jdt...@mail.gmail.com%3E


Re: Looking for Shepherd for MESOS-5002

2016-03-22 Thread Neil Conway
Sure, I'd be happy to review the change.

Neil


On Tue, Mar 22, 2016 at 9:01 AM, Jie Yu  wrote:
> + Neil
>
> Neil is driving the documentation improvement in Mesos. Neil, do you have
> time for that? I can help commit the patch if you give a shipit.
>
> - Jie
>
> On Tue, Mar 22, 2016 at 8:45 AM, Jiří Šimša  wrote:
>
>> Hello,
>>
>> Can anyone please shepherd the following JIRA:
>>
>> https://issues.apache.org/jira/browse/MESOS-5002
>>
>> This issue reflects the recent renaming of Tachyon to Alluxio in Mesos'
>> documentation. Thanks.
>>
>> Best,
>>
>> --
>> Jiří Šimša
>>


Re: mesos git commit: Add 'name' field into NetworkInfo.

2016-03-10 Thread Neil Conway
Should we also update docs/networking-for-mesos-managed-containers.md?
It contains a version of the NetworkInfo message definition.

Neil

On Thu, Mar 10, 2016 at 11:05 AM,   wrote:
> Repository: mesos
> Updated Branches:
>   refs/heads/master 57a574fc9 -> 2a436e02f
>
>
> Add 'name' field into NetworkInfo.
>
> Review: https://reviews.apache.org/r/44004/
>
>
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/2a436e02
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/2a436e02
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/2a436e02
>
> Branch: refs/heads/master
> Commit: 2a436e02f7f475e2d7264c6a4b58dd557bfec883
> Parents: 57a574f
> Author: Qian Zhang 
> Authored: Thu Mar 10 11:04:59 2016 -0800
> Committer: Jie Yu 
> Committed: Thu Mar 10 11:04:59 2016 -0800
>
> --
>  include/mesos/mesos.proto| 5 +
>  include/mesos/v1/mesos.proto | 5 +
>  src/common/http.cpp  | 8 
>  3 files changed, 18 insertions(+)
> --
>
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/2a436e02/include/mesos/mesos.proto
> --
> diff --git a/include/mesos/mesos.proto b/include/mesos/mesos.proto
> index 3d22ec3..56d456a 100644
> --- a/include/mesos/mesos.proto
> +++ b/include/mesos/mesos.proto
> @@ -1581,6 +1581,11 @@ message NetworkInfo {
>// this field is filled in automatically with the Agent IP address.
>repeated IPAddress ip_addresses = 5;
>
> +  // Name of the network which will be used by network isolator to determine
> +  // the network that the container joins. It's up to the network isolator
> +  // to decide how to interpret this field.
> +  optional string name = 6;
> +
>// Specify IP address requirement. Set protocol to the desired value to
>// request the network isolator on the Agent to assign an IP address to the
>// container being launched. If a specific IP address is specified in
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/2a436e02/include/mesos/v1/mesos.proto
> --
> diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
> index 31960a5..4fba774 100644
> --- a/include/mesos/v1/mesos.proto
> +++ b/include/mesos/v1/mesos.proto
> @@ -1578,6 +1578,11 @@ message NetworkInfo {
>// this field is filled in automatically with the Agent IP address.
>repeated IPAddress ip_addresses = 5;
>
> +  // Name of the network which will be used by network isolator to determine
> +  // the network that the container joins. It's up to the network isolator
> +  // to decide how to interpret this field.
> +  optional string name = 6;
> +
>// Specify IP address requirement. Set protocol to the desired value to
>// request the network isolator on the Agent to assign an IP address to the
>// container being launched. If a specific IP address is specified in
>
> http://git-wip-us.apache.org/repos/asf/mesos/blob/2a436e02/src/common/http.cpp
> --
> diff --git a/src/common/http.cpp b/src/common/http.cpp
> index be8538f..3e92979 100644
> --- a/src/common/http.cpp
> +++ b/src/common/http.cpp
> @@ -203,6 +203,10 @@ JSON::Object model(const NetworkInfo& info)
>  object.values["ip_addresses"] = std::move(array);
>}
>
> +  if (info.has_name()) {
> +object.values["name"] = info.name();
> +  }
> +
>return object;
>  }
>
> @@ -528,6 +532,10 @@ void json(JSON::ObjectWriter* writer, const NetworkInfo& 
> info)
>}
>  });
>}
> +
> +  if (info.has_name()) {
> +writer->field("name", info.name());
> +  }
>  }
>
>
>


Re: Need CHANGELOG updates

2016-03-03 Thread Neil Conway
I sent https://reviews.apache.org/r/44348/ for the floating point math
changes; if you'd prefer a different format or more/less details, just
let me know.

Thanks,
Neil

On Thu, Mar 3, 2016 at 10:57 AM, Vinod Kone  wrote:
> Hi guys,
>
> The 0.28.0 release is currently blocked on the updates to CHANGELOG.
> Basically I'm looking for shepherds/owners of feature tickets to add a
> blurb in the CHANGELOG for their tickets.
>
> The big ticket items that went into 0.28.0 that I know of
>
> --> net_cls_isolator
> --> floating point math for resources
> --> unified containerizer
> --> executor api v1
>
> If there are other things that need to be called out specifically in the
> CHANGELOG, please reach out to me.
>
> Thanks,


Re: Making 'curl' a prerequisite for installing Mesos

2016-03-03 Thread Neil Conway
No objection to about the additional dependency, but using 'curl'
instead of 'libcurl' seems unfortunate. Can you share some more
detailed information about the problems that have been encountered
using libcurl? e.g., was using the curl_multi_xxx() APIs explored?

Neil

On Thu, Mar 3, 2016 at 9:10 AM, Jie Yu  wrote:
> Hi,
>
> I am proposing making 'curl' a prerequisite when installing Mesos.
> Currently, we require 'libcurl' being present when installing Mesos
> (http://mesos.apache.org/gettingstarted/). However, we found that it does
> not compose well with our asynchronous runtime environment (i.e., it'll
> block the current worker thread).
>
> Recent work on URI fetcher uses 'curl' directly, instead of using 'libcurl'
> to fetch artifacts, because it composes well with our async runtime env.
> 'curl' is installed by default in most systems (e.g., OSX, centos, RHEL).
>
> So I am proposing adding 'curl' to our prerequisite list. Let me know if you
> have any concern on this. I'll update the Getting Started doc if you are OK
> with this change.
>
> Thanks,
> - Jie
>


Re: Discussion about upgrading 3rdparty libraries

2016-03-01 Thread Neil Conway
The prospect of downloading dependencies from "rando" locations is
concerning to me :)

Mesos can easily come to depend on implementation details of a
dependency that might change in a minor release. For example, a recent
change [1] depends on the connection retry logic in the Zk client
library in a fairly delicate way. I also wouldn't want users to
randomly upgrade to, say, protobuf 2.6.1 without it being thoroughly
tested. Increasing the support matrix of different users on different
platforms running arbitrarily different versions of third-party
dependencies doesn't seem like a net improvement to me.

My two cents: if Windows requires additional dependencies that we
aren't currently vendoring, I would personally opt for (a) vendoring
those additional dependencies (b) ensuring that the vendored versions
we ship are modern enough to support all the platforms we care about.
Are there important use-cases that aren't supported by this scheme?

Neil

[1] 
https://github.com/apache/mesos/commit/c2d496ed430eaf7daee3e57edefa39c25af2aa43

On Tue, Mar 1, 2016 at 10:00 AM, Alex Clemmer
 wrote:
> I guess a tl;dr might be in order.
>
> Basically: the CMake build system already supports roping in tarballs
> from rando places on the filesystem or Internet, so I think it makes
> sense to rope them in at configure time, and so I'm proposing we
> re-appropriate the sophisticated tools we already have to do this for
> WIndows, into a more general solution that is useful to other exotic
> platforms, rather than just Windows.
>
> As always, super interested to hear feedback, I'd love to know if I
> missed something.
>
> On Tue, Mar 1, 2016 at 9:58 AM, Alex Clemmer
>  wrote:
>> This is a great time to discuss the Mesos dependency channel story in
>> general, because it has had to evolve a bit to fit the requirements of
>> Windows, and some of the issues you describe are issues we had to
>> resolve (at least partially) to support the Windows integration work.
>>
>> More particularly, our problems are: first, Windows frequently
>> requires newer versions of dependencies (due to poor support of MSVC
>> 1900), so we have had to develop reasonably robust version-selection
>> mechanisms, so that Windows can get specific versions of different
>> packages. This means that the Mesos project does not have to evolve
>> the dependency support story in lock step, which in the long term may
>> actually be required, as some platforms (e.g., those run by
>> governmental organizations) are more conservative about what
>> dependencies are introduced on their clusters.
>>
>> Second, because Windows does not have a package manager, it became
>> necessary for the CMake build system to support actually hitting some
>> remote (possiblty the internet) to rope in the tarballs of arbitrary
>> (and arbitrarily-versioned) dependencies that we normally expect to
>> already be installed (such as APR or cURL).
>>
>> This last point is actually more convenient than it seems. Our CMake
>> implementation recently[1][2] introduced a flag that lets you specify
>> something like `cmake .. -D3RDPARTY_DEPENDENCIES=/some/path/or/url`
>> and it will proactively look for tarballs in the location you give it
>> -- and that location can be either a path on your filesystem, or a
>> URI, like the 3rdparty remote in github[3] that is owned by the GitHub
>> community. From the "exotic platform" perspective this is great
>> because it makes it trivial for people building (say) Windows to
>> upgrade to a version not supported by CMake:
>>
>> * Put a tarball of a new version somewhere on the filesystem. Say, we
>> decide to use glog 0.3.4 instead of 0.3.3, so we just put that tarball
>> for 0.3.4 in a well-known place in the filesystem.
>> * Update the version of glog in Versions.cmake.
>> * When you run cmake, just run `cmake ..
>> -D3RDPARTY_DEPENDENCIES=/my/fancy/3rdparty/path`
>> * Builds against new dep! Magic!
>>
>> Much of this was developed out of expediency, but going forward I
>> think a reasonable approach to dealing with the third-party channel
>> might be (and I would LOVE feedback on this):
>>
>> WORKFLOW THAT ASSUMES INTERNET ACCESS ON BUILD MACHINE:
>> * Clone a copy of mesos.
>> * (When we do a normal clone of Mesos, there are no tarballs in the
>> `3rdparty/` directory.)
>> * Run `bootstrap`.
>> * `mkdir build && cd build && cmake ..`. Part of the CMake
>> configuration step will be to `git clone` a copy of
>> `https://github.com/3rdparty/mesos-3rdparty`. (If you don't know, the
>> 3rdparty account is owned by the Mesos community, and the
>> `mesos-3rdparty` is where we store canonical copies of all our
>> third-party tarballs.)
>> * This dumps all the tarballs into a folder, `mesos-3rdparty`.
>> * We build against the tarballs we retrieved. Optionally you are
>> allowed to set the versions in `Versions.cmake` and mesos will "just
>> build" against those versions (as long as they are supported, and 

Re: [VOTE] Release Apache Mesos 0.27.2 (rc1)

2016-02-29 Thread Neil Conway
As described (briefly) in the release emails, 0.27.2, 0.26.1, 0.25.1,
and 0.24.2 contains a new feature: "reliable floating point for scalar
resources" (MESOS-4687).

To elaborate on that slightly, Mesos now only supports scalar resource
values with three decimal digits of precision (e.g., reserving "5.001
CPUs" for a task). As a result of this change, frameworks that do
their own resource math may see slightly different results;
furthermore, if any frameworks were trying to manage extremely
fine-grained resource values (> 3 decimal digits of precision), that
will no longer be supported.

For more information, please see:

https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAOW5sYZJn5caBOwZyPV008JgL1F2FYFxL_bM5CtYA2PF2OG7Bw%40mail.gmail.com%3E
https://docs.google.com/document/d/14qLxjZsfIpfynbx0USLJR0GELSq8hdZJUWw6kaY_DXc/edit?usp=sharing
https://issues.apache.org/jira/browse/MESOS-4687

Neil


On Fri, Feb 26, 2016 at 8:54 PM, Michael Park  wrote:
> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 0.27.2.
>
>
> 0.27.2 includes the following:
> 
>
> MESOS-4693 - Variable shadowing in HookManager::slavePreLaunchDockerHook.
> MESOS-4711 - Race condition in libevent poll implementation causes crash.
> MESOS-4754 - The "executors" field is exposed under a backwards incompatible
> schema.
> MESOS-4687 - Implement reliable floating point for scalar resources.
>
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.2-rc1
> 
>
> The candidate for Mesos 0.27.2 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/0.27.2-rc1/mesos-0.27.2.tar.gz
>
> The tag to be voted on is 0.27.2-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.2-rc1
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.27.2-rc1/mesos-0.27.2.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.27.2-rc1/mesos-0.27.2.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1104
>
> Please vote on releasing this package as Apache Mesos 0.27.2!
>
> The vote is open until Wed Mar 2 23:59:59 PST 2016 and passes if a majority
> of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 0.27.2
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> MPark, Joris, Kapil


Re: Enable compiler optimization by default?

2016-02-18 Thread Neil Conway
Great! I created https://issues.apache.org/jira/browse/MESOS-4709 to
track this issue.

Neil

On Thu, Feb 18, 2016 at 12:43 AM, Jan Schlicht <j...@mesosphere.io> wrote:
> +1
>
> On Thu, Feb 18, 2016 at 2:34 AM, Klaus Ma <klaus1982...@gmail.com> wrote:
>
>> +1;
>>
>> So our CI will also update to use optimisation flags, right?  We need to
>> highlight this in upgrade document to our user; I used to meet so strange
>> behaviour after changing -O level.
>>
>> On Thu, Feb 18, 2016 at 8:51 AM James DeFelice <james.defel...@gmail.com>
>> wrote:
>>
>> > +1
>> > On Feb 17, 2016 7:24 PM, "Neil Conway" <neil.con...@gmail.com> wrote:
>> >
>> > > Hi folks,
>> > >
>> > > At present, Mesos defaults to compiling with "-O0"; to enable compiler
>> > > optimizations, the user needs to specify "--enable-optimize".
>> > >
>> > > I'd like to propose we change the default, for a few reasons:
>> > >
>> > > (1) The autoconf default for CFLAGS/CXXFLAGS is "-O2 -g". Anecdotally,
>> > > I think most software packages compile with a reasonable level of
>> > > optimizations enabled by default.
>> > >
>> > > (2) I think we should make the default configure flags appropriate for
>> > > end-users (rather than Mesos developers): developers will be familiar
>> > > enough with Mesos to tune the configure flags according to their own
>> > > preferences.
>> > >
>> > > (3) The performance consequences of not enabling compiler
>> > > optimizations can be pretty severe: 5x in a benchmark I just ran, and
>> > > we've seen between 2x and 30x (!) performance differences for some
>> > > real-world workloads.
>> > >
>> > > Neil
>> > >
>> >
>> --
>>
>> Regards,
>> 
>> Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
>> IBM Platform Development & Support, STG, IBM GCG
>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>>
>
>
>
> --
> *Jan Schlicht*
> Distributed Systems Engineer, Mesosphere


Re: Enable compiler optimization by default?

2016-02-17 Thread Neil Conway
On Wed, Feb 17, 2016 at 5:07 PM, Zameer Manji  wrote:
> Can't this problem also be solved by distributing packages that have
> optimized binaries?

The individuals/organizations that build packaged versions of Mesos
should ensure that compiler optimizations are enabled -- but I don't
think this entirely solves the problem, as some portion of users will
use the source tarballs distributed by the Mesos project.

Neil


Enable compiler optimization by default?

2016-02-17 Thread Neil Conway
Hi folks,

At present, Mesos defaults to compiling with "-O0"; to enable compiler
optimizations, the user needs to specify "--enable-optimize".

I'd like to propose we change the default, for a few reasons:

(1) The autoconf default for CFLAGS/CXXFLAGS is "-O2 -g". Anecdotally,
I think most software packages compile with a reasonable level of
optimizations enabled by default.

(2) I think we should make the default configure flags appropriate for
end-users (rather than Mesos developers): developers will be familiar
enough with Mesos to tune the configure flags according to their own
preferences.

(3) The performance consequences of not enabling compiler
optimizations can be pretty severe: 5x in a benchmark I just ran, and
we've seen between 2x and 30x (!) performance differences for some
real-world workloads.

Neil


Re: [2/2] mesos git commit: Added documentation for labeled reserved resources.

2016-02-12 Thread Neil Conway
Hi Ben,

On Fri, Feb 12, 2016 at 2:34 AM, Benjamin Mahler  wrote:
> Any plans to support labels for static reservations?
>
> Are we intentionally not supporting ReservationInfo for static
> reservations? Or is this just outside of the initial scope?

Labels for static reservations are not currently supported because
`labels` is part of `ReservationInfo`, and the latter is not set for
static reservations.

Setting ReservationInfo for static reservations is
https://issues.apache.org/jira/browse/MESOS-3486 . I didn't take this
on right now, because there are some backward compatibility concerns
with making this change. It is also unclear if we want to continue
adding features to static reservations vs. continuing to enhance
dynamic reservations to the point at which they can replace static
reservations for most use cases.

Neil


Re: Shepherd for MESOS-3486

2016-02-12 Thread Neil Conway
Hi Michael,

Thanks for taking this on! Joris Van Remoortere will shepherd, but
please also include me in the review request.

Thanks,
Neil


On Thu, Feb 11, 2016 at 5:21 PM, Michael Browning
 wrote:
> Hello,
>
> I've picked a small issue off of the newbie issues stack to get started
> with:
>
> https://issues.apache.org/jira/browse/MESOS-3486
>
> ...and I've commented on the issue with a proposed solution, and am now
> looking for a shepherd on this one.
>
> Thank you,
> Michael


Precision of scalar resources

2016-02-12 Thread Neil Conway
tl;dr:

If you use resource values with more than three decimal digits of
precision (e.g., you are launching a task that uses 2.5001 CPUs),
please speak up!



Mesos uses floating point to represent scalar resource values, such as
the number of CPUs in a resource offer or dynamic reservation. The
master does resource math in floating point, which leads to a few
problems:

* due to roundoff error, frameworks can receive offers that have
unexpected resource values (e.g., MESOS-3990)
* various internal assertions in the master can fail due to roundoff
error (e.g., MESOS-3552).

In the long term, we can solve these problems by switching to a
fixed-point representation for scalar values. However, that will
require a long deprecation cycle.

In the short term, we should make floating point behavior more
reliable. To do that, I propose:

(1) Resource values will support AT MOST three decimal digits of
precision. Additional precision in resource values will be discarded
(via rounding).

(2) The master will internally used a fixed-point representation to
avoid unpredictable roundoff behavior.

For more details, please see the design doc here:
https://docs.google.com/document/d/14qLxjZsfIpfynbx0USLJR0GELSq8hdZJUWw6kaY_DXc
-- comments welcome!

Thanks,
Neil


Re: Questions about release process

2016-02-04 Thread Neil Conway
freebsd.hpp is missing from the 0.27 release tarball, presumably
because 3rdparty/libprocess/3rdparty/stout/include/Makefile.am was not
updated to account for it. I'll send an RR shortly.

Neil

On Thu, Feb 4, 2016 at 10:16 AM, haosdent  wrote:
> Hi, David. I could saw you commit in 0.27
> https://github.com/apache/mesos/blob/0.27.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/freebsd.hpp
> Why you said it is missing?
>
> On Fri, Feb 5, 2016 at 2:11 AM, David Forsythe  wrote:
>
>> Hi,
>>
>> A few changes for FreeBSD support landed in this release (at least
>> partially), but they don't seem to have made it into CHANGELOG. More
>> concerning is that a file in stout seems to be missing[1].
>>
>> For CHANGELOG, I assume I did something wrong in JIRA (example ticket: [2]
>> -- if something is wrong there, please let me know), but I'm not sure what
>> do about getting new files included. What can I do to make sure that
>> included changes are complete in future releases?
>>
>> Thanks!
>>
>> [1] 3rdparty/libprocess/3rdparty/stout/include/stout/os/freebsd.hpp
>> [2] https://issues.apache.org/jira/browse/MESOS-1563
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang


Re: Version numbers in docs

2016-02-02 Thread Neil Conway
I agree we should remove this text after some period of time has
passed since that version of Mesos was released; it is quite
distracting.

The proper long-term fix is probably to have version-specific docs. So
all the documentation at

/documentation/v0.27/...

would implicitly discuss the features that are present in Mesos 0.27.
That would eliminate a lot of the need for version number references
in the documentation text itself.

Neil

On Tue, Feb 2, 2016 at 4:48 PM, Greg Mann  wrote:
> Hi all!
> In our docs, you find such sentences as:
>
> "Mesos 0.15.0 added support for framework authentication, and 0.19.0 added
> slave authentication."
>
> It's a minor point, but I wonder how long we should maintain such version
> numbers in the docs? In the example above, my feeling is that we are far
> enough past 0.19.0 that we could change this to read simply, "Mesos
> supports authentication of both frameworks and slaves".
>
> Thoughts?
>
> Cheers,
> Greg


Re: Follow up on the proposal for simulation tools for master and allocator

2016-01-21 Thread Neil Conway
Hi Zhitao,

There's a JIRA here:

https://issues.apache.org/jira/browse/MESOS-3855

A few people who are interested in simulation of Mesos have been
meeting periodically, although due to the holidays we haven't had a
meeting in a little bit. I'll make sure you're included in the next
meeting when we get it scheduled.

Thanks,
Neil

On Wed, Jan 20, 2016 at 11:14 AM, Zhitao Li  wrote:
> Hi,
>
> I saw a message from last year 
> (http://www.mail-archive.com/dev%40mesos.apache.org/msg33342.html 
> ) about a 
> proposal for simulation tools. Has it been formalized as a JIRA issue so 
> interested parties can subscribe and contribute design ideas?
>
> Thanks.


Re: mesos git commit: Added recommendations for programming with persistent volumes.

2016-01-19 Thread Neil Conway
Hi Alex,

Good point. I added some docs for this behavior a few weeks ago:

https://github.com/apache/mesos/commit/6d0619e2e1fbf78411f881f431269539c7d24565

But that appears in a different doc page. You're probably right that
it is worth mentioning here as well -- I'll send a review shortly.

Neil

On Tue, Jan 19, 2016 at 12:48 AM, Alex R <ruklet...@gmail.com> wrote:
> One more caveat here is when there are multiple frameworks in the role: one
> framework may successfully reserve certain resources but they will be
> offered to another framework in the role. Do you think it's worth
> mentioning this use case in the doc?
>
> On 18 January 2016 at 23:30, <jo...@apache.org> wrote:
>
>> Repository: mesos
>> Updated Branches:
>>   refs/heads/master 12455d0d0 -> e2963966a
>>
>>
>> Added recommendations for programming with persistent volumes.
>>
>> Added recommendations for programming with persistent volumes.
>>
>> Review: https://reviews.apache.org/r/41952/
>>
>>
>> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e2963966
>> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e2963966
>> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e2963966
>>
>> Branch: refs/heads/master
>> Commit: e2963966acc5c2263849ef183c9ee57251102d0e
>> Parents: 12455d0
>> Author: Neil Conway <neil.con...@gmail.com>
>> Authored: Mon Jan 18 17:30:19 2016 -0500
>> Committer: Joris Van Remoortere <joris.van.remoort...@gmail.com>
>> Committed: Mon Jan 18 17:30:19 2016 -0500
>>
>> --
>>  docs/persistent-volume.md | 66 ++
>>  1 file changed, 66 insertions(+)
>> --
>>
>>
>>
>> http://git-wip-us.apache.org/repos/asf/mesos/blob/e2963966/docs/persistent-volume.md
>> --
>> diff --git a/docs/persistent-volume.md b/docs/persistent-volume.md
>> index f969975..4af7d6e 100644
>> --- a/docs/persistent-volume.md
>> +++ b/docs/persistent-volume.md
>> @@ -334,3 +334,69 @@ The user receives one of the following HTTP responses:
>>
>>  Note that a single `/destroy-volumes` request can destroy multiple
>> persistent
>>  volumes, but all of the volumes must be on the same slave.
>> +
>> +### Programming with Persistent Volumes
>> +
>> +Some suggestions to keep in mind when building applications that use
>> persistent
>> +volumes:
>> +
>> +* A single `acceptOffers` call can be used to both create a new dynamic
>> +  reservation (via `Offer::Operation::Reserve`) and create a new
>> persistent
>> +  volume on those newly reserved resources (via
>> `Offer::Operation::Create`).
>> +
>> +* Attempts to dynamically reserve resources or create persistent volumes
>> might
>> +  fail---for example, because the network message containing the
>> operation did
>> +  not reach the master or because the master rejected the operation.
>> +  Applications should be prepared to detect failures and correct for them
>> (e.g.,
>> +  by retrying the operation).
>> +
>> +* When using HTTP endpoints to reserve resources or create persistent
>> volumes,
>> +  _some_ failures can be detected by examining the HTTP response code
>> returned
>> +  to the client. However, it is still possible for a `200` response code
>> to be
>> +  returned to the client but for the associated operation to fail.
>> +
>> +* When using the scheduler API, detecting that a dynamic reservation has
>> failed
>> +  is a little tricky: reservations do not have unique identifiers, and
>> the Mesos
>> +  master does not provide explicit feedback on whether a reservation
>> request has
>> +  succeeded or failed. Hence, framework schedulers typically use a
>> combination
>> +  of two techniques:
>> +
>> +  1. They use timeouts to detect that a reservation request may have
>> failed
>> + (because they don't receive a resource offer containing the expected
>> + resources after a given period of time).
>> +
>> +  2. To check whether a resource offer includes the effect of a dynamic
>> + reservation, applications _cannot_ check for the presence of a
>> "reservation
>> + ID" or similar value (because reservations do not have IDs). Instead,
>> + applications should examine th

Re: Links in documentation

2016-01-14 Thread Neil Conway
On Thu, Jan 14, 2016 at 11:39 AM, Joris Van Remoortere
 wrote:
>> *In fact it seems that all links ending with .md are interpreted as
>> relative links on the webpage, i.e. [label](https://test.com/foo.md) is
>> rendered into https://test.com/foo/
>> ">label.
>
> I think this should be fixed. We shouldn't be restricted from linking to
> external documentation.

I opened https://issues.apache.org/jira/browse/MESOS-4384 for this.

Neil


Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

2016-01-08 Thread Neil Conway
On Fri, Jan 8, 2016 at 12:29 PM, Benjamin Mahler  wrote:
> (2) It is difficult to reliably obtain cluster state through the existing
> endpoints. This one is less clear to me than the first problem. Here we
> have to think through how we want users to be hitting state endpoints. Do
> they hit all the masters and take the first valid response? Do they first
> ask for the leader, then query the leader? Both of these have races (the
> first case has an issue that the requests are not atomic, you may receive
> two valid responses ; the second case the leader information may become
> stale before the second request). Do we add redirects? Even redirects have
> issues, there may be multiple redirects, there may be a redirect to a
> master that is unable to redirect further (and so we haven't really solved
> the race difficulties with redirects).

I believe the proposed behavior is:

* Clients can query any master
* Endpoint queries against a non-leading master result in redirects to
the current leader

If the client follows a redirect to a different master, it may get
redirected one or more times; it might also be unable to reach the
current leader, or the queried master might be unable to determine the
current leader. That seems like quite reasonable behavior to me,
though (and technically I would argue that these situations aren't
really "races" -- the client just needs to recognize that as in any
distributed system, the information it observes might be stale).

We could alternatively introduce a "who-is-the-current-leader"
endpoint (which is something people have asked for [1]). As long as
non-leading masters notify clients that they aren't talking to a
leader (e.g., by returning a 403/503 error), that should also avoid
races.

Neil

[1] https://issues.apache.org/jira/browse/MESOS-3841


Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

2016-01-06 Thread Neil Conway
+1 -- I think we should make this change. The current behavior is
quite dangerous.

Neil

On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes  wrote:
> Hi, Adam and Haosdent
>
>
> Resurrecting this issue, https://issues.apache.org/jira/browse/MESOS-1865, I 
> would like to make a +1 for this change, which apparently became cold but I 
> think is very relevant and we had enough time to be prepared for a change 
> like this, right?
>
>
> If necessary, can I help with something?
>
>
> Diogo Gomes
>
>
>
>


Re: No master is currently leading ...

2016-01-06 Thread Neil Conway
Hi,

Can you post the full logs from all of the master instances?

BTW, the @dev list is mostly intended for discussion around the
development of Mesos. The @user list is a better venue for user
support/configuration questions.

Thanks,
Neil


On Wed, Jan 6, 2016 at 12:58 PM, DiGiorgio, Mr. Rinaldo S.
 wrote:
> Hi,
>
> I just installed mesos 0.26 with zookeeper and without zookeeper.  In both 
> cases I am getting a new message that I have not seen before.
>
> No master is currently leading ...
>
> Did I miss something in the release notes about some additional configuration 
> required
>
> Rinaldo


Re: Mesos build & testing environment instructions

2015-12-17 Thread Neil Conway
+1 to the general idea of including this information in the documentation.

I'd probably lean towards including this information in the current
"Getting Started" page, but in a separate section ("Running The Test
Suite"?).

Neil

On Thu, Dec 17, 2015 at 12:38 PM, Greg Mann  wrote:
> Hey folks!
> Something occurred to me recently which is related to the extensive testing
> we did in preparation for the 0.26.0 release. Since I started contributing
> to the project, my Source of Truth for "how to prepare a given platform to
> compile and run Mesos" has been the Getting Started page of our
> documentation. However, this doc doesn't provide guidance for all platforms
> on "how to prepare this platform to compile Mesos and then TEST it in all
> configurations", which is crucial information for us when it comes to
> testing, and would be useful to our users as well. I wonder if it makes
> sense to have a separate place in our documentation where we include these
> exhaustive installation instructions, which may be beyond the scope of a
> "Getting Started" document for new users.
>
> One option is to introduce a new documentation section on testing, where we
> can include supplementary installation instructions, as well as information
> on the test suite, how to run it, the available options, etc. We already
> have a page on good patterns to use when *writing* tests, but nothing I can
> find on running the tests, besides a brief mention in "Getting Started".
>
> Another option would be to just expand our existing install instructions to
> be a bit more comprehensive and include instructions for optional
> components like libevent2, docker, kernel updates to enable cgroup tests,
> etc.  Especially on an older platform like CentOS 6.6, this can be tricky.
>
> Note that some of these installations (like libevent2 on CentOS 6.6)
> require the use of hard-to-find RPMs whose origin is uncertain, and it's
> possible that we wouldn't want to offer such instructions publicly to our
> users.
>
> Thoughts?
>
> Cheers,
> Greg


Re: Speed up Mesos tests

2015-12-16 Thread Neil Conway
+1 on the speed-up-the-tests project!

On Wed, Dec 16, 2015 at 10:29 AM, Greg Mann  wrote:
> I'd like to bring up something that both Neil and Joseph mentioned to me
> recently, which could be of use when working on these slow test tickets.
> Since we have the `process::Clock` class, it's quite easy to control the
> clock manually, and doing so can both speed up tests as well as make them
> more deterministic/less flaky. While we're working on the above tickets, I
> think it would be nice to look for opportunities to alter the tests we're
> touching to pause the clock and then advance it explicitly using `pause()`,
> `settle()`, and `advance()`, rather than letting it run as usual.

Yep -- I think eventually having the clock paused by default for tests
would probably be a good idea:

https://issues.apache.org/jira/browse/MESOS-4101

To make that happen, we might need a few more primitives to force
"pending" events to be processed before manually advancing the clock.
`Clock::settle()` works for libprocess messages, but not for socket
communication more generally (e.g., when using the HTTP API). It would
help to get rid of this kludge in `Clock::settle` as well:

https://issues.apache.org/jira/browse/MESOS-3760

Neil


  1   2   >