Re: Moving the website repo from svn to git

2017-06-01 Thread Vinod Kone
Tim, with the 2 repo option, the idea is that the source of the website will 
still reside in the main repo even if we keep the publish contents in a 
different repo. 

@vinodkone

> On Jun 1, 2017, at 8:42 PM, Timothy Anderegg  
> wrote:
> 
> Just to chime in, I'm almost done with the changes to the website code that
> allows the user to select the version of documentation they wish to see
> (haosdent is reviewing the final revisions), and it does depend on using
> git to checkout the previous versions of the website via tags, so if we did
> isolate the website code to a specific branch or repo, we would also need
> to ensure that the tags of commits to the website code stay in sync with
> tags of commits to the actual code.  This would not be too challenging, but
> something to keep in mind.
> 
> Keeping the website code in a separate repository might be easier to manage
> from this perspective, since tags are effectively global to a given repo,
> so if we kept the website code in a special branch within the main repo,
> we'd need something like a tag called "1.3.0" for the main code, and
> "website-1.3.0" for the website code, which could be confusing.
> 
>> On Thu, Jun 1, 2017 at 8:53 PM Vinod Kone  wrote:
>> 
>> Thanks for the analysis Benjamin. Really appreciate it.
>> 
>> You bring up good points esp about size bump for supporting multiple
>> versions.
>> 
>> Btw, do the numbers change if publish content is only in a branch ? Guess
>> not?
>> 
>> Maybe we can start with a separate git repo and see if it's painful enough
>> to merge it into our source repo.
>> 
>> @vinodkone
>> 
>>> On Jun 1, 2017, at 4:06 PM, Benjamin Bannier <
>> benjamin.bann...@mesosphere.io> wrote:
>>> 
>>> Hi Vinod,
>>> 
 *Implementation details: *
 
 We have an option to move to
 1) a standalone git repo (say "mesos-site") which will be mirrored on
 github.
 2) just use our "mesos" git repo and publish a "asf-site" branch with
 website contents (say at 'site/publish' directory)
 
 I'm leaning towards 2) because that allows us to deal with single repo
 instead of two.
>>> 
>>> I have never updated the website so I cannot comment on the pain
>> involved.
>>> 
>>> As a user of the Mesos source git repository I would however like to
>> bring up that _all_ of the website’s assets are generated from files
>> present in the source repository (at some point in time). The largest
>> fraction of the `publish` directory is Doxygen documentation (currently
>>> 90% at ~100 MB). We should weigh the effect this would have for developers
>> should we add this content to the Mesos source repository.
>>> 
>>> To get a ballpark idea I imported the website’s history into a git
>> repository. After the initial import its `.git` directory contained ~100 MB
>> which went down to ~30MB after aggressive repository repacking. A fresh
>> clone of the Mesos source repository amounts to ~280 MB, so it seems we
>> would add at least 10% to the repositories size with little benefit to
>> developers. Depending on the implementation, this number would likely
>> increase would we e.g., provide version-dependent website content, or
>> introduce website asset formats not compressing as nicely with git (e.g.,
>> generated graphics).
>>> 
>>> I have the feeling keeping this content in a separate repository might
>> strike a better balance for developers.
>>> 
>>> 
>>> Benjamin
>>> 
>> 


Re: Moving the website repo from svn to git

2017-06-01 Thread Vinod Kone
Thanks for the analysis Benjamin. Really appreciate it. 

You bring up good points esp about size bump for supporting multiple versions. 

Btw, do the numbers change if publish content is only in a branch ? Guess not?

Maybe we can start with a separate git repo and see if it's painful enough to 
merge it into our source repo. 

@vinodkone

> On Jun 1, 2017, at 4:06 PM, Benjamin Bannier  
> wrote:
> 
> Hi Vinod,
> 
>> *Implementation details: *
>> 
>> We have an option to move to
>> 1) a standalone git repo (say "mesos-site") which will be mirrored on
>> github.
>> 2) just use our "mesos" git repo and publish a "asf-site" branch with
>> website contents (say at 'site/publish' directory)
>> 
>> I'm leaning towards 2) because that allows us to deal with single repo
>> instead of two.
> 
> I have never updated the website so I cannot comment on the pain involved.
> 
> As a user of the Mesos source git repository I would however like to bring up 
> that _all_ of the website’s assets are generated from files present in the 
> source repository (at some point in time). The largest fraction of the 
> `publish` directory is Doxygen documentation (currently >90% at ~100 MB). We 
> should weigh the effect this would have for developers should we add this 
> content to the Mesos source repository.
> 
> To get a ballpark idea I imported the website’s history into a git 
> repository. After the initial import its `.git` directory contained ~100 MB 
> which went down to ~30MB after aggressive repository repacking. A fresh clone 
> of the Mesos source repository amounts to ~280 MB, so it seems we would add 
> at least 10% to the repositories size with little benefit to developers. 
> Depending on the implementation, this number would likely increase would we 
> e.g., provide version-dependent website content, or introduce website asset 
> formats not compressing as nicely with git (e.g., generated graphics).
> 
> I have the feeling keeping this content in a separate repository might strike 
> a better balance for developers.
> 
> 
> Benjamin
> 


Re: Moving the website repo from svn to git

2017-06-01 Thread Benjamin Bannier
Hi Vinod,

> *Implementation details: *
> 
> We have an option to move to
> 1) a standalone git repo (say "mesos-site") which will be mirrored on
> github.
> 2) just use our "mesos" git repo and publish a "asf-site" branch with
> website contents (say at 'site/publish' directory)
> 
> I'm leaning towards 2) because that allows us to deal with single repo
> instead of two.

I have never updated the website so I cannot comment on the pain involved.

As a user of the Mesos source git repository I would however like to bring up 
that _all_ of the website’s assets are generated from files present in the 
source repository (at some point in time). The largest fraction of the 
`publish` directory is Doxygen documentation (currently >90% at ~100 MB). We 
should weigh the effect this would have for developers should we add this 
content to the Mesos source repository.

To get a ballpark idea I imported the website’s history into a git repository. 
After the initial import its `.git` directory contained ~100 MB which went down 
to ~30MB after aggressive repository repacking. A fresh clone of the Mesos 
source repository amounts to ~280 MB, so it seems we would add at least 10% to 
the repositories size with little benefit to developers. Depending on the 
implementation, this number would likely increase would we e.g., provide 
version-dependent website content, or introduce website asset formats not 
compressing as nicely with git (e.g., generated graphics).

I have the feeling keeping this content in a separate repository might strike a 
better balance for developers.


Benjamin



Re: Moving the website repo from svn to git

2017-06-01 Thread Benjamin Mahler
So is the extra branch only used to limit the frequency that the bot sees
new commits? Or to hide website publishing commits from the master branch?
Trying to understand why we wouldn't include the publish directory directly
in the master branch.

On Thu, Jun 1, 2017 at 2:48 PM, Vinod Kone  wrote:

> On Thu, Jun 1, 2017 at 2:24 PM, Benjamin Mahler 
> wrote:
>
> > Curious to know more about what 2) looks like.
> >
>
>
> We would be maintaining a remote branch called "asf-site", similar to how
> we maintain remote release branches (1.0.x etc) today. The only difference
> would be that this branch will also include our website publish contents
> (e.g., in mesos/site/publish directory).
>
> Workflow would be:
>
> 1) Whenever we want to update the website, we generate/update the publish
> directory and do a push to that remote branch.
>
> 2) An ASF bot (gitsvnpubsub) will automatically track changes to this
> branch and refreshes the website at mesos.apache.org.
>
> We will automate 1) via Jenkins. Infra for 2) already exists, we just need
> to ask ask ASF INFRA folks to have the bot track our branch.
>
> Let me know if that doesn't clear things up.
>


Questions about Mesos starting procedure in the source code

2017-06-01 Thread Wenzhao Zhang
Hi, All:

I just start working on Mesos source code for a research project. I become
confused about the starting procedure, thus need some help.
I'm talking about the working procedure of using, "mesos-execute" to
execute a docker image.

1. How is resource offered to the framework (docker) from master?
In Master::offer(), I find a "ResourceOffersMessage" is sent.
I search the source code, find that only "mesos-1.2.0/src/sched/*sched.cpp*"
has a function to receive this message, and this function finally invokes a
scheduler-driver to finish the task.
But, I believe this is not the procedure in which resource is offered to
the docker-image, as I don't see any logic of "mesos-1.2.0/src/cli/
*execute.cpp*" using "*sched.cpp*";and according to the documentation,
 "Mesos provides a simple executor that can execute shell commands and
Docker containers on behalf of the framework scheduler".

   In "*execute.cpp*", I see a "offers()" function, which finally executes
some executors. But I don't see where this function is call from the
master?
   How does this simple executor executes shell commands and Docker
containers on behalf of the framework scheduler?
   How is the "*sched.cpp*" used in the source code?


2. After "execute.cpp" subscribes to the master, "framework" information is
created in the master.
   But how is this "framework" info distributed to the slaves?  I become
confused about this procedure.

Could anyone kindly give some suggestions?
Thanks very much

Wenzhao


Re: RFC: Partition Awareness

2017-06-01 Thread Vinod Kone
On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:

> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>

Adding to what Neil said, I think most (if not all) non-PA frameworks
would've already rescheduled the task after seeing a TASK_LOST. The
difference is that previously such tasks can come back to TASK_RUNNING iff
master fails over and non-strict registry (default) is used. Now, we are
saying tasks can come back to TASK_RUNNING irrespective of master fail
over. The assumption/hope is that this shouldn't break existing frameworks
in a catastrophic way.


Re: Moving the website repo from svn to git

2017-06-01 Thread Vinod Kone
On Thu, Jun 1, 2017 at 2:24 PM, Benjamin Mahler  wrote:

> Curious to know more about what 2) looks like.
>


We would be maintaining a remote branch called "asf-site", similar to how
we maintain remote release branches (1.0.x etc) today. The only difference
would be that this branch will also include our website publish contents
(e.g., in mesos/site/publish directory).

Workflow would be:

1) Whenever we want to update the website, we generate/update the publish
directory and do a push to that remote branch.

2) An ASF bot (gitsvnpubsub) will automatically track changes to this
branch and refreshes the website at mesos.apache.org.

We will automate 1) via Jenkins. Infra for 2) already exists, we just need
to ask ask ASF INFRA folks to have the bot track our branch.

Let me know if that doesn't clear things up.


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-06-01 Thread Benjamin Mahler
+1 (binding)

Looks like ExamplesTest.DynamicReservationFramework is flaky, unfortunately
wasn't able to get the logs for a failed run.

On Thu, Jun 1, 2017 at 2:03 PM, Benjamin Mahler  wrote:

> Not a blocker, but noticed the parallel test runner isn't bundled in the
> release, if you configure with '--enable-parallel-test-execution':
>
> /Users/bmahler/Downloads/mesos-1.3.0/support/mesos-gtest-runner.py
> --sequential=*ROOT_* ./stout-tests
> /bin/sh: /Users/bmahler/Downloads/mesos-1.3.0/support/mesos-gtest-runner.py:
> No such file or directory
>
> On Wed, May 31, 2017 at 1:48 PM, Vinod Kone  wrote:
>
>> Thanks for the triage.
>>
>> +1 (binding)
>>
>> On Wed, May 31, 2017 at 1:33 PM, Neil Conway 
>> wrote:
>>
>>> On Tue, May 30, 2017 at 3:43 PM, Neil Conway 
>>> wrote:
>>> > Attached is the test log for this failure. From a quick look, seems as
>>> > though the agent starts to launch the task, including forking the
>>> > child process, but no subsequent task status updates or error messages
>>> > are observed. Gaston, have you seen this before?
>>> >
>>> > I filed https://issues.apache.org/jira/browse/MESOS-7589 to track
>>> this.
>>>
>>> I wasn't able to repro this failure. Per Gaston's email, there isn't
>>> enough information in the logs to understand what is going on here,
>>> although it certainly seems weird that apparently the executor doesn't
>>> start.
>>>
>>> I think this doesn't justify blocking the release, but we should watch
>>> to see if the problem recurs.
>>>
>>> Neil
>>>
>>
>>
>


Re: RFC: Partition Awareness

2017-06-01 Thread Neil Conway
Hi Ben,

The argument for changing the semantics is that correct frameworks
should _always_ have accounted for the possibility that TASK_LOST
tasks would go back to running (due to the non-strict registry
semantics). The proposed change would just increase the probability of
this behavior occurring. From a certain POV, this change would
actually make it easier to write correct frameworks because the
TASK_LOST scenario will be less of a corner case :)

Implementing the task-killing behavior is a bit tricky, because the
task might continue to run on the agent for a considerable period of
time. During that time, we can either:

(a) omit the being-killed task from the master's memory (current
behavior). That means that any resources used by the task appear to be
unused, so there might be a concurrent task launch that attempts to
use them and fails.

(b) track the being-killed task in the master's memory. This ensures
the task's resources are not re-offered until the task is actually
terminated. The concern here is that this "being-killed" task is in a
weird state -- what task status should it have? When it finally dies,
we don't want to report a terminal status update back to frameworks
(for backward compatibility).

Neither of those approaches seemed ideal, hence we are wondering
whether we really need to implement this backward compatibility
behavior in the first place.

Neil

On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:
> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>
> It seems like we should just directly fix the issue, do you have a sense of
> what the difficulty is there? Is it the re-use of the existing framework
> shutdown message to kill the tasks that makes this problematic?
>
> On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:
>>
>> Hi All,
>>
>> We are working on fixing a potential issue MESOS-7215 with partition
>> awareness which happens when an unreachable agent, with tasks for
>> non-Partition Aware frameworks, attempts to re-register with the master.
>> Before the support for partition-aware frameworks, which was introduced in
>> Mesos 1.1.0 MESOS-5344,  if an agent partitioned from the master attempted
>> to re-register, then it will be shut down and all the tasks on the agent
>> would be terminated. With this feature, the partitioned agents were no
>> longer shut down by the master when they re-registered but to keep the old
>> behavior the tasks on these agents were still shutdown if the corresponding
>> framework didn’t opt-in to partition awareness.
>>
>> One of the possible solutions to address the issue mentioned in MESOS-7215
>> is to change master’s behavior to not kill the tasks for non-Partition aware
>> frameworks when an unreachable agent re-registers with the master. When an
>> agent goes unreachable i.e. fails the masters health check ping for
>> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
>> all the tasks on this agent which have been launched by non-Partition Aware
>> frameworks. So, if such tasks are no longer killed by the master then upon
>> agent re-registration the frameworks will see a non-terminal status updates
>> for tasks for which they already received a TASK_LOST.
>> This change will hopefully not break any schedulers since it could have
>> happened in the past with non-strict registry as well and schedulers are
>> expected to be resilient enough to handle this scenario.
>>
>> For the proposed solution we wanted to get feedback from the community to
>> ensure that this change doesn’t break or cause any side effects for the
>> schedulers. Looking forward to any feedbacks/comments.
>>
>> Many Thanks
>> Megha
>>
>>
>


Re: RFC: Partition Awareness

2017-06-01 Thread Benjamin Mahler
If I understood correctly, the proposal is to not kill the tasks for
non-partition aware frameworks? That seems like a pretty big change for
frameworks that are not partition aware and expect the old killing
semantics.

It seems like we should just directly fix the issue, do you have a sense of
what the difficulty is there? Is it the re-use of the existing framework
shutdown message to kill the tasks that makes this problematic?

On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:

> Hi All,
>
> We are working on fixing a potential issue MESOS-7215
>  with partition
> awareness which happens when an unreachable agent, with tasks for
> non-Partition Aware frameworks, attempts to re-register with the master.
> Before the support for partition-aware frameworks, which was introduced in
> Mesos 1.1.0 MESOS-5344 ,
> if an agent partitioned from the master attempted to re-register, then it
> will be shut down and all the tasks on the agent would be terminated. With
> this feature, the partitioned agents were no longer shut down by the master
> when they re-registered but to keep the old behavior the tasks on these
> agents were still shutdown if the corresponding framework didn’t opt-in to
> partition awareness.
>
> One of the possible solutions to address the issue mentioned in MESOS-7215
>  is to change master’s
> behavior to not kill the tasks for non-Partition aware frameworks when an
> unreachable agent re-registers with the master. When an agent goes
> unreachable i.e. fails the masters health check ping for
> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
> all the tasks on this agent which have been launched by non-Partition Aware
> frameworks. So, if such tasks are no longer killed by the master then upon
> agent re-registration the frameworks will see a non-terminal status updates
> for tasks for which they already received a TASK_LOST.
> This change will hopefully not break any schedulers since it could have
> happened in the past with non-strict registry as well and schedulers are
> expected to be resilient enough to handle this scenario.
>
> For the proposed solution we wanted to get feedback from the community to
> ensure that this change doesn’t break or cause any side effects for the
> schedulers. Looking forward to any feedbacks/comments.
>
> Many Thanks
> Megha
>
>
>


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-06-01 Thread Benjamin Mahler
Not a blocker, but noticed the parallel test runner isn't bundled in the
release, if you configure with '--enable-parallel-test-execution':

/Users/bmahler/Downloads/mesos-1.3.0/support/mesos-gtest-runner.py
--sequential=*ROOT_* ./stout-tests
/bin/sh:
/Users/bmahler/Downloads/mesos-1.3.0/support/mesos-gtest-runner.py: No such
file or directory

On Wed, May 31, 2017 at 1:48 PM, Vinod Kone  wrote:

> Thanks for the triage.
>
> +1 (binding)
>
> On Wed, May 31, 2017 at 1:33 PM, Neil Conway 
> wrote:
>
>> On Tue, May 30, 2017 at 3:43 PM, Neil Conway 
>> wrote:
>> > Attached is the test log for this failure. From a quick look, seems as
>> > though the agent starts to launch the task, including forking the
>> > child process, but no subsequent task status updates or error messages
>> > are observed. Gaston, have you seen this before?
>> >
>> > I filed https://issues.apache.org/jira/browse/MESOS-7589 to track this.
>>
>> I wasn't able to repro this failure. Per Gaston's email, there isn't
>> enough information in the logs to understand what is going on here,
>> although it certainly seems weird that apparently the executor doesn't
>> start.
>>
>> I think this doesn't justify blocking the release, but we should watch
>> to see if the problem recurs.
>>
>> Neil
>>
>
>


June 3rd: MesosCon North America CFP due!

2017-06-01 Thread Judith Malnick
Hi Apache Mesos Users and Devs,

This is a friendly reminder that the CFP for MesosCon North America

will close:

*Saturday, June 3rd*!

If you've been working on a talk proposal, please put the finishing touches
on it and send it in. The reviewers are really excited to see everyone's
ideas; don't be shy.

If you have any questions feel free to reach out to me (
jmaln...@mesosphere.com) or Kiersten Gaffney (kiers...@mesosphere.io).

All the best!
Judith

-- 
Judith Malnick
DC/OS Community Manager
310-709-1517


Re: Mesos on Windows needs your help

2017-06-01 Thread Artem Harutyunyan
Thank you very much, Li, and the rest of the team at Microsoft for pushing
this through!

We'll make sure to pay attention to the reviewbot, and we would also really
appreciate if you could vote on release candidates to make sure that in
case something slips the CI we catch it before cutting the release.

Artem.

On Tue, May 30, 2017 at 10:58 AM, Li Li  wrote:

> With the joint effort from Mesosphere and Microsoft, the windows build
> performance *should* be about equal with Posix/Linux now, ~76% tests are
> enabled on the ported windows components, and Mesos container/docker
> container tasks are launched successfully e2e.
>
> We will start helping Mesos windows customers deploy their windows agent
> nodes in their test environments, and then productize these features as our
> next goals. To be able to do that, we need a stable development
> environment.
>
> Recently, there have been multiple regressions on Windows from build
> issues to functionality issues. We have been chasing down these
> regressions, fixing them and trying to push Mesos on Windows features
> forward. However, we all know the situation cannot be sustained well with
> the high frequency of the regressions.
>
> To solve the issues, we’ve enabled two engineering system features for
> Mesos on Windows to prevent regressions before and after each checkin,
>
>1. *Windows reviewbot has been enabled to verify all of the tests on
>windows for each PR.* For details, please refer to
>https://reviews.apache.org/r/59116/
>.
>
>
>1. *Windows b**u**ild process has been added to CI system. *The build
>status is posted to #windows channel by the CI bot after committing a
>PR,
>
>
>
> The build regressions are generally caught manually (i.e. git pull &&
> cmake --build .) or when the CI bot posts a failure in the #windows
> channel. For now, these build regressions don't get sent to the
> bui...@mesos.apache.org mailing list due to the flakiness we're seeing in
> the builds@ mailing list.
>
> For developers, if you do not have access to a Windows box, you have two
> options:
> 1. use the Windows Reviewbot.  This runs in a loop (slightly different
> than the Ubuntu Reviewbot) but both reviewbots function the same way.  Just
> push an update to the last review in a chain, and the reviewbot will get
> around to it eventually.
>
> 2. Spin up a Windows box in Azure, AWS or some other cloud with Windows
> Server 2016 + Docker + all the dependencies from
> https://github.com/apache/mesos/blob/master/docs/windows.md
> .
>
>
> *We highly recommend everyone to a**nalyze the Windows Reviewbot before
> your checkins and monitor Windows build status after your checkins. *
>
> The above engineering system effort is just a starting point to prevent
> the regressions. We also need help from our Mesos dev community – when you
> checkin a fix, think about if there are some potential regressions on the
> windows side and verify your fix on Windows as well; when you design a
> feature, feel free to involve us in to your discussions and see how these
> features should be designed for windows, etc.
>
> Only with your help, we can deliver Mesos to our Linux customers, and
> Windows customers successfully.
>
>
>


[Containerization WG] Notes 06/01/2017

2017-06-01 Thread Jie Yu
Hi

Today, we went over the design proposal (CPU pinning in Mesos) from Dmitry.

Please find today's notes in the following doc.
https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugt
c2nHR89skFXSpU/edit?usp=sharing

The recording can be found here:
https://drive.google.com/file/d/0B78XyMjTpvTmRS1Ya1ZFT3BLQjA/view?usp=sharing

- Jie