Re: Cgroups v2 + Python 3 + Upstreaming X/Twitter Patches

2024-04-03 Thread Benjamin Mahler
Just an update here, in late January we finished upstreaming our internal
patches that were upstreamable. This amounted to 35 patches.

The cgroups v2 work is ongoing, we're hoping to be mostly code complete by
the end of this month.

On Fri, Jan 12, 2024 at 6:01 PM Benjamin Mahler  wrote:

> +user@
>
> On Fri, Jan 12, 2024 at 5:55 PM Benjamin Mahler 
> wrote:
>
>> As part of upgrading to CentOS 9 at X/Twitter, Shatil / Devin (cc'ed)
>> will be working on:
>>
>> * Upgrading to Python 3
>> * Cgroups v2 support
>>
>> We will attempt to upstream this work for the benefit of other users.
>>
>> In addition, we have several long-standing internal patches that should
>> have been upstreamed but weren't. We currently deploy from an internal
>> 1.9.x branch with these additional patches on top of OSS 1.9.x. To simplify
>> the above work since the code changes will be substantial (especially for
>> cgroups v2), we'll try to work off master (or 1.11.x) and upstream our
>> long-standing internal patches to minimize the delta between OSS and our
>> internal branch.
>>
>> Let me know if folks have any questions. IMO it would be good for users
>> to hold off on going into the attic so that we can land this work at least.
>>
>> Ben
>>
>


Re: Add s390x support for Mesos CI

2024-04-01 Thread Benjamin Mahler
You can schedule some time with me to try to add this, since you need an
apache account to modify these.
The ARM one is separate as Tomek mentioned, so likely there would need to
be a separate one for s390x:

https://ci-builds.apache.org/job/Mesos/job/Mesos-Buildbot-ARM/

It looks like this uses an arm label that matches no nodes, so it hasn't
been running: https://ci-builds.apache.org/label/arm/
I don't see any nodes listed for s390x either?
https://ci-builds.apache.org/label/s390x/

Maybe this ui doesn't work properly, because I was able to find them here,
and they do seem to use the s390x label:
https://jenkins-ccos.apache.org/view/Shared%20-%20s390x%20nodes/

On Fri, Mar 22, 2024 at 2:42 AM Yasir Ashfaq 
wrote:

> It would be great if we can get a separate configuration matrix for s390x.
> We would be able to add new configurations once they start working on
> s390x.
> Could someone who can edit jenkins configuration, add below configuration
> for s390x  :
>
> --
> Configuration Matrix
>|  gcc
>
> --
> ubuntu:18.04 | --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution | autotools | ✅
>
>  cmake | ✅
>
> --
>


Re: Add s390x support for Mesos CI

2024-03-11 Thread Benjamin Mahler
presumably you want to update the configuration matrix here?

https://ci-builds.apache.org/job/Mesos/job/Mesos-Buildbot/

not sure where this configuration lives though

On Mon, Mar 11, 2024 at 1:28 AM Yasir Ashfaq 
wrote:

> Hi All,
>
> We recently added changes(https://github.com/apache/mesos/pull/449) to
> provide s390x CI support for Mesos.
> Even though the changes are merged, Mesos CI has not been enabled yet.
>
> As per https://issues.apache.org/jira/browse/INFRA-21433, we have
> provided apache four s390x nodes, hence, same nodes can be used to enable
> Apache Mesos CI as well.
>
> Could you please let us know if there is anything further required from
> our side to enable Mesos CI support for s390x.
>
> Thanks & Regards,
> Yasir
>


Re: Cgroups v2 + Python 3 + Upstreaming X/Twitter Patches

2024-01-16 Thread Benjamin Mahler
> >
> https://urldefense.com/v3/__https://sydneyscientific.org__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSXruBk3U$
> > >
> > >> | consultancy
> > >> <
> >
> https://urldefense.com/v3/__https://offscale.io__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSQf69uCY$
> > >
> > >> | open-source
> > >> <
> >
> https://urldefense.com/v3/__https://github.com/offscale__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSR5dWm9A$
> > >
> > >> | LinkedIn
> > >> <
> >
> https://urldefense.com/v3/__https://linkedin.com/in/samuelmarks__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbS0q7dfzk$
> > >
> > >>
> > >>
> > >> On Fri, Jan 12, 2024 at 6:01 PM Benjamin Mahler 
> > >> wrote:
> > >>
> > >>> +user@
> > >>>
> > >>> On Fri, Jan 12, 2024 at 5:55 PM Benjamin Mahler 
> > >>> wrote:
> > >>>
> > >>> > As part of upgrading to CentOS 9 at X/Twitter, Shatil / Devin
> (cc'ed)
> > >>> will
> > >>> > be working on:
> > >>> >
> > >>> > * Upgrading to Python 3
> > >>> > * Cgroups v2 support
> > >>> >
> > >>> > We will attempt to upstream this work for the benefit of other
> users.
> > >>> >
> > >>> > In addition, we have several long-standing internal patches that
> > should
> > >>> > have been upstreamed but weren't. We currently deploy from an
> > internal
> > >>> > 1.9.x branch with these additional patches on top of OSS 1.9.x. To
> > >>> simplify
> > >>> > the above work since the code changes will be substantial
> (especially
> > >>> for
> > >>> > cgroups v2), we'll try to work off master (or 1.11.x) and upstream
> > our
> > >>> > long-standing internal patches to minimize the delta between OSS
> and
> > >>> our
> > >>> > internal branch.
> > >>> >
> > >>> > Let me know if folks have any questions. IMO it would be good for
> > >>> users to
> > >>> > hold off on going into the attic so that we can land this work at
> > >>> least.
> > >>> >
> > >>> > Ben
> > >>> >
> > >>>
> > >>
> >
>


Re: Cgroups v2 + Python 3 + Upstreaming X/Twitter Patches

2024-01-12 Thread Benjamin Mahler
+user@

On Fri, Jan 12, 2024 at 5:55 PM Benjamin Mahler  wrote:

> As part of upgrading to CentOS 9 at X/Twitter, Shatil / Devin (cc'ed) will
> be working on:
>
> * Upgrading to Python 3
> * Cgroups v2 support
>
> We will attempt to upstream this work for the benefit of other users.
>
> In addition, we have several long-standing internal patches that should
> have been upstreamed but weren't. We currently deploy from an internal
> 1.9.x branch with these additional patches on top of OSS 1.9.x. To simplify
> the above work since the code changes will be substantial (especially for
> cgroups v2), we'll try to work off master (or 1.11.x) and upstream our
> long-standing internal patches to minimize the delta between OSS and our
> internal branch.
>
> Let me know if folks have any questions. IMO it would be good for users to
> hold off on going into the attic so that we can land this work at least.
>
> Ben
>


Cgroups v2 + Python 3 + Upstreaming X/Twitter Patches

2024-01-12 Thread Benjamin Mahler
As part of upgrading to CentOS 9 at X/Twitter, Shatil / Devin (cc'ed) will
be working on:

* Upgrading to Python 3
* Cgroups v2 support

We will attempt to upstream this work for the benefit of other users.

In addition, we have several long-standing internal patches that should
have been upstreamed but weren't. We currently deploy from an internal
1.9.x branch with these additional patches on top of OSS 1.9.x. To simplify
the above work since the code changes will be substantial (especially for
cgroups v2), we'll try to work off master (or 1.11.x) and upstream our
long-standing internal patches to minimize the delta between OSS and our
internal branch.

Let me know if folks have any questions. IMO it would be good for users to
hold off on going into the attic so that we can land this work at least.

Ben


Re: Next steps for Mesos

2023-03-20 Thread Benjamin Mahler
Also if you are still a user of mesos, please chime in.
Qian, it might be worth having a more explicit email asking users to chime
in as this email was tailored more for contributors.

Twitter is still using mesos heavily, we upgraded from a branch based off
of 1.2.x to 1.9.x in 2021, but haven't upgraded to 1.11.x yet. We do have a
lot of patches carried on our branch that have not been upstreamed. I would
like to upstream them to avoid relying on many custom patches and to
get closer to HEAD, but it will take time and quite a bit of work, and it's
not a priority at the moment.

On the contribution side, at this point if I were to continue contributing,
it would be on a volunteer basis, and I can't guarantee having enough time
to do so.

On Fri, Mar 17, 2023 at 9:57 PM Qian Zhang  wrote:

> Hi all,
>
> I'd like to restart the discussion around the future of the Mesos project.
> As you may already be aware, the Mesos community has been inactive for the
> last few years, there were only 3 contributors last year, that's obviously
> not enough to keep the project moving forward. I think we need at least 3
> active committers/PMC members and some active contributors to keep the
> project alive, or we may have to move it to attic
> .
>
> Call for action: If you are the current committer/PMC member and still
> have the capacity to maintain the project, or if you are willing to
> actively contribute to the project as a contributor, please reply to this
> email, thanks!
>
>
> Regards,
> Qian Zhang
>


Re: Apache Mesos Twitter Account

2022-08-21 Thread Benjamin Mahler
Yes I still have access, I was mostly managing it during the 2016-present
period but haven't done anything since late 2020. Do you have anything in
mind you'd like to see tweeted or retweeted?

On Thu, Aug 18, 2022 at 5:05 AM Andreas Peters 
wrote:

> Hi,
>
> sorry for this non technical question. I already tried it unsuccessfully
> on other ways. :-(
>
> Does someone have access to the Twitter account of Mesos? The last tweet
> is already decades ;-) ago. Would be good if we can reactivate it.
>
>
> Regards,
>
> Andreas
>
>


Re: [VOTE] Move Apache Mesos to Attic

2021-04-06 Thread Benjamin Mahler
+1 (binding)

Thanks to all who contributed to the project.

On Mon, Apr 5, 2021 at 1:58 PM Vinod Kone  wrote:

> Hi folks,
>
> Based on the recent conversations
> <
> https://lists.apache.org/thread.html/raed89cc5ab78531c48f56aa1989e1e7eb05f89a6941e38e9bc8803ff%40%3Cuser.mesos.apache.org%3E
> >
> on our mailing list, it seems to me that the majority consensus among the
> existing PMC is to move the project to the attic <
> https://attic.apache.org/>
> and let the interested community members collaborate on a fork in Github.
>
> I would like to call a vote to dissolve the PMC and move the project to the
> attic.
>
> Please reply to this thread with your vote. Only binding votes from
> PMC/committers count towards the final tally but everyone in the community
> is encouraged to vote. See process here
> .
>
> Thanks,
>


Re: Problems building HEAD of mesos 1.10.x branch

2020-12-07 Thread Benjamin Mahler
Hi Chris, I have some patches that fix command execution on windows, as
there are functional issues (not only compilation issues). However we don't
have the resources to actively develop and support Windows at the current
time.

If you're interested I can try to dig up my branch and send you the code.

On Mon, Dec 7, 2020 at 9:44 AM Chris Holman 
wrote:

> I managed to fix this with a pre-compiler conditional in
> mesos:\3rdparty\stout\include\stout\os\exec.hpp
> My C++ is extremely rusty, so not sure if this is the right fix to use,
> but it seems to work for me.
>
> $ git diff 3rdparty/stout/include/stout/os/exec.hpp
> --- BEGIN  GIT DIFF ---
> diff --git a/3rdparty/stout/include/stout/os/exec.hpp
> b/3rdparty/stout/include/stout/os/exec.hpp
> index 686e7a152..ea1b4b3fc 100644
> --- a/3rdparty/stout/include/stout/os/exec.hpp
> +++ b/3rdparty/stout/include/stout/os/exec.hpp
> @@ -75,7 +75,16 @@ inline Option spawn(
>  // TODO(bmahler): Probably we shouldn't provide this windows
>  // emulation and should instead have the caller use windows
>  // subprocess functions directly?
> +#ifdef __WINDOWS__
> +inline int execvp(
> +const std::string& file,
> +const std::vector& argv);
> +#else
>  inline int execvp(const char* file, char* const argv[]);
> +#endif // __WINDOWS__
> +
> +
> +
>
>
>  // This function is a portable version of execvpe ('p' means searching
> @@ -108,7 +117,15 @@ inline int execvp(const char* file, char* const
> argv[]);
>  // TODO(bmahler): Probably we shouldn't provide this windows
>  // emulation and should instead have the caller use windows
>  // subprocess functions directly?
> +#ifdef __WINDOWS__
> +inline int execvpe(
> +const std::string& file,
> +const std::vector& argv,
> +const std::map& envp);
> +#else
>  inline int execvpe(const char* file, char** argv, char** envp);
> +#endif // __WINDOWS__
> +
>
>  } // namespace os {
>
> --- END  GIT DIFF ---
>
> -Original Message-
> From: Chris Holman 
> Sent: 01 December 2020 18:42
> To: dev@mesos.apache.org
> Subject: Problems building HEAD of mesos 1.10.x branch
>
> Hi,
>
> I'm trying to build Windows Mesos Agent from the 1.10.x branch of the
> https://github.com/apache/mesos.git repo.
>
> My environment is as follows:
>
>   *   Windows Server 2019
>   *   Visual Studio 2017
>
> cd c:\temp
> git clone https://github.com/apache/mesos.git
> mkdir mesos\build
> cd mesos\build
> # Generate build scripts
> cmake .. -G "Visual Studio 15 2017" -A "x64"
> -DPATCHEXE_PATH="C:\ProgramData\chocolatey\lib\patch\tools\bin" -T host=x64
> # Build mesos agent cmake --build . --target mesos-agent --config Release
>
> This is the version I started from:
> C:\temp\mesos>git log -n 1
> commit 2c9e829067465f10bab22aba273e1e3a93f60770 (HEAD -> 1.10.x,
> origin/1.10.x)
> Author: Benjamin Mahler mailto:bmah...@apache.org>>
> Date:   Mon Oct 26 16:58:35 2020 -0400
>
> Updated Postgres URL in CentOS 6 Dockerfile.
>
> The link was pointing to an rpm package that has since been
> replaced on the upstream server.
>
> I encountered a couple of compile and linker errors which I fixed along
> the way, but his last issue has me stumped:
>
> c:\temp\mesos\3rdparty\stout\include\stout\exit.hpp(69): warning C4722:
> '__Exit::~__Exit': destructor never returns, potential memory leak
> [C:\temp\mesos\build\src\launcher\mesos-executor.vcxproj]
> mesos.lib(launch.obj) : error LNK2019: unresolved external symbol "int
> __cdecl os::execvp(char const *,char * const * const)" (?execvp@os
> @@YAHPEBDQEBQEAD@Z) referenced in function "protected: virtual int
> __cdecl mesos::internal::slave::
> MesosContainerizerLaunch::execute(void)" (?execute@MesosContainerizerLaunch
> @slave@internal@mesos@@MEAAHXZ)
> [C:\temp\mesos\build\src\launcher\mesos-executor.vcxproj]
> mesos.lib(launch.obj) : error LNK2019: unresolved external symbol "int
> __cdecl os::execvpe(char const *,char * *,char * *)" (?execvpe@os
> @@YAHPEBDPEAPEAD1@Z) referenced in function "protected: virtual int
> __cdecl mesos::internal::slave::
> MesosContainerizerLaunch::execute(void)" (?execute@MesosContainerizerLaunch
> @slave@internal@mesos@@MEAAHXZ)
> [C:\temp\mesos\build\src\launcher\mesos-executor.vcxproj]
> C:\temp\mesos\build\src\mesos-executor.exe : fatal error LNK1120: 2
> unresolved externals
> [C:\temp\mesos\build\src\launcher\mesos-executor.vcxproj]
>
> I can see that the Windows implementation of os::execvp and os::execvpe is
> different to the linux implementation:
>
> Windows: C:\temp\mesos\3rdparty\stout\include\stout\os\windows\exec

Re: Design document: constraints-based offer filtering

2020-07-29 Thread Benjamin Mahler
Just to add some color to this, picky scheduling has been a long standing
issue with the two level scheduling architecture of mesos. Given that mesos
does not have enough information from schedulers to be able to pick offers
that the scheduler wants, it can take a very long time to receive a usable
offer.

In the past couple of years, we shipped some great improvements to improve
the allocator's performance, as well as prevent offer starvation. Despite
this, it's still the case that if you have picky apps (e.g. constraints
limit the app to only run on 1 agent in the cluster), it can take a very
long time to finally receive the right offer. The adoption of quota has
exacerbated the issue, because it limits the amount of offers you can
receive concurrently (in the pathological case where you have a small
amount of quota consumption left, you can only receive 1 offer at a time).

We've previously stated that we would solve this by implementing a new
"optimistic" offer model that employs optimistic concurrency control
(providing an equivalent to what was explained in Google's omega paper).
However, we found that adding constraints-based offer filtering will solve
the picky scheduling issue for the use cases we've seen, and is much easier
to implement for both mesos and schedulers. Mostly what we see is
schedulers having very picky apps based on some form of constraint (e.g.
marathon's constraint language, or the presence of a specific resource
reservation).

So, this will be a huge improvement in the next Mesos release for any users
that use reservations or scheduling constraints to limit where their apps
are deployed.

Please let us know if you have any questions! And thanks Andrei for the
hard work on flushing out the design details.

Ben

On Tue, Jul 28, 2020 at 9:21 AM Andrei Sekretenko 
wrote:

> Hi all,
> Recently, I and my colleagues have been designing a mechanism in Mesos that
> will allow a framework to put constraints on the contents of the offers it
> receives: on the attributes of offered agents, and, as a next step, on
> resources in the offers, so that the framework is more likely to receive an
> offer it really needs.
> The primary aim of this design is to help "picky" frameworks running in
> presence of quota reduce scheduling latency.
>
> I've distilled the implementation proposal on the Mesos side into a design
> doc draft:
>
> https://docs.google.com/document/d/1MV048BwjLSoa8sn_5hs4kIH4YJMf6-Gsqbij3YuT1No
> <
> https://docs.google.com/document/d/1MV048BwjLSoa8sn_5hs4kIH4YJMf6-Gsqbij3YuT1No/edit#heading=h.wq9atl6k4yq0
> >
> /edit#heading=h.wq9atl6k4yq0
> <
> https://docs.google.com/document/d/1MV048BwjLSoa8sn_5hs4kIH4YJMf6-Gsqbij3YuT1No/edit#heading=h.wq9atl6k4yq0
> >
>
> --
> Best regards,
> Andrei Sekretenko
>


Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-06-12 Thread Benjamin Mahler
Ah yes I forgot, the other piece is network membership for the replicated
log, through our zookeeper::Group related code. Is that what you're
referring to?

We could put that behind a module interface as well.

On Fri, Jun 12, 2020 at 9:10 PM Benjamin Mahler  wrote:

> > Apache ZooKeeper is used for a number of different things in Mesos, with
> > only leader election being customisable with modules. Your existing
> modular
> > functionality is insufficient for decoupling from Apache ZooKeeper.
>
> Can you clarify which other functionality you're referring to? Mesos only
> relies on ZK for leader election and detection. We do have some libraries
> available in the code for storing the registry in ZK but we do not support
> that currently.
>
> On Thu, Jun 11, 2020 at 11:02 PM Samuel Marks  wrote:
>
>> Apache ZooKeeper is used for a number of different things in Mesos, with
>> only leader election being customisable with modules. Your existing
>> modular
>> functionality is insufficient for decoupling from Apache ZooKeeper.
>>
>> We are ready and waiting to develop here.
>>
>> As mentioned over our off-mailing-list communiqué:
>>
>> The main advantages—and reasoning—for my investment into Mesos has been
>> [the prospect of]:
>>
>>- Making it performant and low-resource utilising on a very small
>> number
>>of nodes… potentially even down to 1 node so that it can 'compete' with
>>Docker Compose.
>>- Reducing the number of distributed systems that all do the same thing
>>in a datacentre environment.
>>   - Postgres has its own consensus, Docker—e.g, via Kubernetes or
>>   Compose—has its own consensus, ZooKeeper has its own consensus,
>> other
>>   things like distributed filesystems… they too; have their own
>> consensus.
>>- The big sell from that first point is actually showing people how to
>>run Mesos and use it for their regular day-to-day development, e.g.:
>>1. Context switching when the one engineer is on multiple projects
>>   2. …then use the same technology at scale.
>>- The big sell from that second point is to reduce the network traffic,
>>speed up each systems consensus—through all using the one system—and
>>simplify analytics.
>>
>>This would be a big deal for your bigger clients, who can easily
>>quantify what this network traffic costs, and what a reduction in
>> network
>>traffic with a corresponding increase in speed would mean.
>>
>>Eventually this will mean that Ops people can tradeoff guarantees for
>>speed (and vice-versa).
>>- Supporting ZooKeeper, Consul, and etcd is just the start.
>>- Supporting Mesos is just the start.
>>- We plan on adding more consensus-guaranteeing systems—maybe even our
>>own Paxos and Raft—and adding this to systems in the Mesos ecosystem
>> like
>>Chronos, Marathon, and Aurora.
>>It is my understanding that a big part of Mesosphere's rebranding is
>>Kubernetes related.
>>
>> Recently—well, just before COVID19!—I spoke at the Sydney Kubernetes
>> Meetup
>> at Google. They too—including Google—were excited by the prospect of
>> removing etcd as a hard-dependency, and supporting all the different ones
>> liboffkv supports.
>>
>> I have the budget, team, and expertise at the ready to invest and
>> contribute these changes. If there are certain design patterns and
>> refactors you want us to commit to along the way, just say the word.
>>
>> Excitedly yours,
>>
>> Samuel Marks
>> Charity <https://sydneyscientific.org> | consultancy <https://offscale.io
>> >
>> | open-source <https://github.com/offscale> | LinkedIn
>> <https://linkedin.com/in/samuelmarks>
>>
>>
>> On Wed, Jun 10, 2020 at 1:42 AM Benjamin Mahler 
>> wrote:
>>
>> > AndreiS just reminded me that we have module interfaces for the master
>> > detector and contender:
>> >
>> >
>> >
>> https://github.com/apache/mesos/blob/1.9.0/include/mesos/module/detector.hpp
>> >
>> >
>> https://github.com/apache/mesos/blob/1.9.0/include/mesos/module/contender.hpp
>> >
>> >
>> >
>> https://github.com/apache/mesos/blob/1.9.0/include/mesos/master/detector.hpp
>> >
>> >
>> https://github.com/apache/mesos/blob/1.9.0/include/mesos/master/contender.hpp
>> >
>> > These should allow you to implement the integration with your library,
>> we
>> > may need to adjust the 

Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-06-12 Thread Benjamin Mahler
> Apache ZooKeeper is used for a number of different things in Mesos, with
> only leader election being customisable with modules. Your existing
modular
> functionality is insufficient for decoupling from Apache ZooKeeper.

Can you clarify which other functionality you're referring to? Mesos only
relies on ZK for leader election and detection. We do have some libraries
available in the code for storing the registry in ZK but we do not support
that currently.

On Thu, Jun 11, 2020 at 11:02 PM Samuel Marks  wrote:

> Apache ZooKeeper is used for a number of different things in Mesos, with
> only leader election being customisable with modules. Your existing modular
> functionality is insufficient for decoupling from Apache ZooKeeper.
>
> We are ready and waiting to develop here.
>
> As mentioned over our off-mailing-list communiqué:
>
> The main advantages—and reasoning—for my investment into Mesos has been
> [the prospect of]:
>
>- Making it performant and low-resource utilising on a very small number
>of nodes… potentially even down to 1 node so that it can 'compete' with
>Docker Compose.
>- Reducing the number of distributed systems that all do the same thing
>in a datacentre environment.
>   - Postgres has its own consensus, Docker—e.g, via Kubernetes or
>   Compose—has its own consensus, ZooKeeper has its own consensus, other
>   things like distributed filesystems… they too; have their own
> consensus.
>- The big sell from that first point is actually showing people how to
>run Mesos and use it for their regular day-to-day development, e.g.:
>1. Context switching when the one engineer is on multiple projects
>   2. …then use the same technology at scale.
>- The big sell from that second point is to reduce the network traffic,
>speed up each systems consensus—through all using the one system—and
>simplify analytics.
>
>This would be a big deal for your bigger clients, who can easily
>quantify what this network traffic costs, and what a reduction in
> network
>traffic with a corresponding increase in speed would mean.
>
>Eventually this will mean that Ops people can tradeoff guarantees for
>speed (and vice-versa).
>- Supporting ZooKeeper, Consul, and etcd is just the start.
>- Supporting Mesos is just the start.
>- We plan on adding more consensus-guaranteeing systems—maybe even our
>own Paxos and Raft—and adding this to systems in the Mesos ecosystem
> like
>Chronos, Marathon, and Aurora.
>It is my understanding that a big part of Mesosphere's rebranding is
>Kubernetes related.
>
> Recently—well, just before COVID19!—I spoke at the Sydney Kubernetes Meetup
> at Google. They too—including Google—were excited by the prospect of
> removing etcd as a hard-dependency, and supporting all the different ones
> liboffkv supports.
>
> I have the budget, team, and expertise at the ready to invest and
> contribute these changes. If there are certain design patterns and
> refactors you want us to commit to along the way, just say the word.
>
> Excitedly yours,
>
> Samuel Marks
> Charity <https://sydneyscientific.org> | consultancy <https://offscale.io>
> | open-source <https://github.com/offscale> | LinkedIn
> <https://linkedin.com/in/samuelmarks>
>
>
> On Wed, Jun 10, 2020 at 1:42 AM Benjamin Mahler 
> wrote:
>
> > AndreiS just reminded me that we have module interfaces for the master
> > detector and contender:
> >
> >
> >
> https://github.com/apache/mesos/blob/1.9.0/include/mesos/module/detector.hpp
> >
> >
> https://github.com/apache/mesos/blob/1.9.0/include/mesos/module/contender.hpp
> >
> >
> >
> https://github.com/apache/mesos/blob/1.9.0/include/mesos/master/detector.hpp
> >
> >
> https://github.com/apache/mesos/blob/1.9.0/include/mesos/master/contender.hpp
> >
> > These should allow you to implement the integration with your library, we
> > may need to adjust the interfaces a little, but this will let you get
> what
> > you need done without the burden on us to shepherd the work.
> >
> > On Fri, May 22, 2020 at 8:38 PM Samuel Marks  wrote:
> >
> > > Following on from the discussion on GitHub and here on the
> mailing-list,
> > > here is the proposal from me and my team:
> > > --
> > >
> > > Choice of approach
> > >
> > > The “mediator” of every interaction with ZooKeeper in Mesos is the
> > > ZooKeeper
> > > class, declared in include/mesos/zookeeper/zookeeper.hpp.
> > >
> > > Of note are the following two differences in the 

Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-06-09 Thread Benjamin Mahler
ag is provided.
>
> However to avoid polluting the code, we are recommending the second
> approach.
> Incompatibilities
>
> The following is the list of incompatibilities between the interfaces of
> ZooKeeper class and liboffkv. Some of those features should be implemented
> in liboffkv; others should be emulated inside the ZooKeeper/KvClient class;
> and for others still, the change of the interface of ZooKeeper/KvClient is
> the preferred solution.
>
>-
>
>Asynchronous session establishment. We propose to emulate this through
>spawning a new thread in the constructor of ZooKeeper/KvClient.
>-
>
>Push-style watch notification API. We propose to emulate this through
>spawning a new thread for each watch; such a thread would then do the
> wait
>and then invoke watcher->process() under a mutex. The number of threads
>should not be a concern here, as the only user that uses watches at all
> (
>GroupProcess) only registers at most one watch.
>-
>
>Multiple servers in URL string. We propose to implement this in
> liboffkv.
>-
>
>Authentication. We propose to implement this in liboffkv.
>-
>
>ACLs (access control lists). The following ACLs are in fact used for
>everything:
>
>_auth.isSome()
>? zookeeper::EVERYONE_READ_CREATOR_ALL
>: ZOO_OPEN_ACL_UNSAFE
>
>We thus propose to:
>1.
>
>   implement rudimentary support for ACLs in liboffkv in the form of an
>   optional parameter to create(),
>
>   bool protect_modify = false
>
>   2.
>
>   change the interface of ZooKeeper/KvClient so that protect_modify
>   flag is used instead of ACLs.
>   -
>
>Configurable session timeout. We propose to implement this in liboffkv.
>-
>
>Getting the actual session timeout, which might differ from the
>user-provided as a result of timeout negotiation with server. We
> propose to
>implement this in liboffkv.
>-
>
>Getting the session ID. We propose to implement this in liboffkv, with
>session ID being std::string; and to modify the interface accordingly.
>It is possible to hash a string into a 64-bit number, but in the
>circumstances given, we think it is just not worth it.
>-
>
>Getting the status of the connection to the server. We propose to
>implement this in liboffkv.
>-
>
>Sequenced nodes. We propose to emulate this in the class. Here is the
>pseudo-code of our solution:
>
>while (true) {
>[counter, version] = get("/counter")
>seqnum = counter + 1
>name = "label" + seqnum
>try {
>commit {
>check "/counter" version,
>set "/counter" seqnum,
>create name value
>}
>break
>} catch (TxnAborted) {}
>}
>
>-
>
>“Recursive” creation of each parent in create(), akin to mkdir -p. This
>is already emulated in the class, as ZooKeeper does not natively support
>it; we propose to extend this emulation to work with liboffkv.
>-
>
>The semantics of the “set” operation if the entry does not exist:
>ZooKeeper fails with ZNONODE in this case, while liboffkv creates a new
>node. We propose to emulate this in-class with a transaction.
>-
>
>The semantics of the “erase” operation: ZooKeeper fails with ZNOTEMPTY
>if node has children, while liboffkv removes the subtree recursively. As
>neither of users ever attempts to remove node with children, we propose
> to
>change the interface so that it declares (and actually implements) the
>liboffkv-compatible semantics.
>-
>
>Return of ZooKeeper-specific Stat structures instead of just versions.
>As both users only use the version field of this structure, we propose
> to
>simply alter the interface so that only the version is returned.
>-
>
>Explicit “session drop” operation that also immediately erases all the
>“leased” nodes. We propose to implement this in liboffkv.
>-
>
>Check if the node being created has leased parent. Currently, liboffkv
>declares this to be unspecified behavior: it may either throw (if
> ZooKeeper
>is used as the back-end) or successfully create the node (otherwise). As
>neither of users ever attempts to create such a node, we propose to
> leave
>this as is.
>
> Estimates
> We estimate that—including tests—this will be ready by the end of next
> month.
> --
>
> Open to alternative suggestions, otherwise

Re: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)

2020-05-27 Thread Benjamin Mahler
+1 (binding)

On Mon, May 18, 2020 at 4:36 PM Andrei Sekretenko 
wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.10.0.
>
> 1.10.0 includes the following major improvements:
>
> 
> * support for resource bursting (setting task resource limits separately
> from requests) on Linux
> * ability for an executor to communicate with an agent via Unix domain
> socket instead of TCP
> * ability for operators to modify reservations via the RESERVE_RESOURCES
> master API call
> * performance improvements of V1 operator API read-only calls bringing
> them on par with V0 HTTP endpoints
> * ability for a scheduler to expect that effects of calls sent through the
> same connection will not be reordered/interleaved by master
>
> NOTE: 1.10.0 includes a breaking change for custom authorizer modules.
> Now, `ObjectApprover`s may be stored by Mesos indefinitely and must be
> kept up-to-date by an authorizer throughout their lifetime.
> This allowed for several bugfixes and performance improvements.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.10.0-rc1
>
> 
>
> The candidate for Mesos 1.10.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz
>
> The tag to be voted on is 1.10.0-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.10.0-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1259
>
> Please vote on releasing this package as Apache Mesos 1.10.0!
>
> The vote is open until Fri, May 21, 19:00 CEST  and passes if a majority
> of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.10.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Andrei Sekretenko
>


Re: [VOTE] Release Apache Mesos 1.7.3 (rc1)

2020-05-07 Thread Benjamin Mahler
+1 (binding)

On Mon, May 4, 2020 at 1:48 PM Greg Mann  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.3.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.3-rc1
>
> 
>
> The candidate for Mesos 1.7.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz
>
> The tag to be voted on is 1.7.3-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.3-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1258
>
> Please vote on releasing this package as Apache Mesos 1.7.3!
>
> The vote is open until Thu, May 7, 11:00 PDT 2020, and passes if a majority
> of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Greg Mann
>


Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-05-01 Thread Benjamin Mahler
So it sounds like:

Zookeeper: Official C library has an async API. Are we gaining a lot with
the third party C++ wrapper you pointed to? Maybe it "just works", but it
looks very inactive and it's hard to tell how maintained it is.

Consul: No official C or C++ library. Only some third party C++ ones that
look pretty inactive. The ppconsul one you linked to does have an issue
about an async API, I commented on it:
https://github.com/oliora/ppconsul/issues/26.

etcd: Can use gRPC c++ client async API.

Since 2 of 3 provide an async API already, I would lean more towards an
async API so that we don't have to change anything with the mesos code when
the last one gets an async implementation. However,  we currently use the
synchronous ZK API so I realize this would be more work to first adjust the
mesos code to use the async zookeeper API. I agree that a synchronous
interface is simpler to start with since that will be an easier integration
and we currently do not perform many concurrent operations (and probably
won't anytime soon).

On Sun, Apr 26, 2020 at 11:17 PM Samuel Marks  wrote:

> In terms of asynchronous vs synchronous interfacing, when we started
> liboffkv, it had an asynchronous interface. Then we decided to drop it and
> implemented a synchronous one, due to the dependent libraries which
> liboffkv uses under the hood.
>
> Our ZooKeeper implementation uses the zookeeper-cpp library
> <https://github.com/tgockel/zookeeper-cpp>—a well-maintained C++ wrapper
> around common Zookeeper C bindings [which we contributed to vcpkg
> <https://github.com/microsoft/vcpkg/pull/7001>]. It has an asynchronous
> interface based on std::future
> <https://en.cppreference.com/w/cpp/thread/future>. Since std::future does
> not provide chaining or any callbacks, a Zookeeper-specific result cannot
> be asynchronously mapped to liboffkv result. In early versions of liboffkv
> we used thread pool to do the mapping.
>
> Consul implementation is based on the ppconsul
> <https://github.com/oliora/ppconsul> library [which we contributed to
> vcpkg
> <
> https://github.com/microsoft/vcpkg/pulls?q=is%3Apr+author%3ASamuelMarks+ppconsul
> >],
> which in turn utilizes libcurl <https://curl.haxx.se/libcurl>.
> Unfortunately, ppconsul uses libcurl's easy interface, and consequently it
> is synchronous by design. Again, in the early version of the library we
> used a thread pool to overcome this limitation.
>
> As for etcd, we autogenerated the gRPC C++ client
> <https://github.com/offscale/etcd-client-cpp> [which we contributed to
> vcpkg
> <https://github.com/microsoft/vcpkg/pull/6999>]. gRPC provides an
> asynchronous interface, so a "fair" async client can be implemented on top
> of it.
>
> To sum up, the chosen toolkit provided two of three implementations require
> thread pool. After careful consideration, we have preferred to give the
> user control over threading and opted out of the asynchrony.
>
> Nevertheless, there are some options. zookeeper-cpp allows building with
> custom futures/promises, so we can create a custom build to use in
> liboffkv/Mesos. Another variant is to use plain C ZK bindings
> <
> https://gitbox.apache.org/repos/asf?p=zookeeper.git;a=tree;f=zookeeper-client/zookeeper-client-c;h=c72b57355c977366edfe11304067ff35f5cf215d;hb=HEAD
> >
> instead of the C++ library.
> As for the Consul client, the only meaningful option is to opt out of using
> ppconsul and operate through libcurl's multi interface.
>
> At this point implementing asynchronous interfaces will require rewriting
> liboffkv from the ground up. I can allocate the budget for doing this, as I
> have done to date. However, it would be good to have some more
> back-and-forth before reengaging.
>
> Design Doc:
>
> https://docs.google.com/document/d/1NOfyt7NzpMxxatdFs3f9ixKUS81DHHDVEKBbtVfVi_0
> [feel free to add it to
> http://mesos.apache.org/documentation/latest/design-docs/]
>
> Thanks,
>
> *SAMUEL MARKS*
> Sydney Medical School | Westmead Institute for Medical Research |
> https://linkedin.com/in/samuelmarks
> Director | Sydney Scientific Foundation Ltd <https://sydneyscientific.org>
> | Offscale.io of Sydney Scientific Pty Ltd <https://offscale.io>
>
> PS: Damien - not against contributing to FoundationDB, but priorities are
> Mesos and the Mesos ecosystem, followed by Kuberentes and its ecosystem.
>
> On Tue, Apr 21, 2020 at 3:19 AM Benjamin Mahler 
> wrote:
>
> > Samuel: One more thing I forgot to mention, we would prefer to use an
> > asynchronous client interface rather than a synchronous one. Is that
> > something you have considered?
> >
> > On Fri, Apr 17, 2020 at 6:11 PM Vinod Kone  wrote:
> >
&g

Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-04-20 Thread Benjamin Mahler
Samuel: One more thing I forgot to mention, we would prefer to use an
asynchronous client interface rather than a synchronous one. Is that
something you have considered?

On Fri, Apr 17, 2020 at 6:11 PM Vinod Kone  wrote:

> Hi Samuel,
>
> Thanks for showing interest in contributing to the project. Having
> optionality between ZooKeeper and Etcd would be great for the project and
> something that has been brought up a few times before, as you noted.
>
> I echo everything that BenM said. As part of the design it would be great
> to see the migration path for users currently using Mesos with ZooKeeper to
> Etcd. Ideally, the migration can happen without much user intervention.
>
> Additionally, from our past experience, efforts like these are more
> successful if the people writing the code have experience with how things
> work in Mesos code base. So I would recommend starting small, maybe have a
> few engineers work on a couple "newbie" tickets and do some small projects
> and have those committed to the project. That gives the committers some
> level of confidence about quality of the code and be more open to bigger
> changes like etcd integration. It would also help contributors get a better
> feeling for the lay of the land and see if they are truly interested in
> maintaining this piece of integration for the long haul. This is a bit of a
> longer path but I think it would be more a fruitful one.
>
> Looking forward to seeing new contributions to Mesos including the above
> design!
>
> Thanks,
>
> On Fri, Apr 17, 2020 at 4:52 PM Samuel Marks  wrote:
>
> > Happy to build a design doc,
> >
> > To answer your question on what Offscale.io is, it's my software and
> > biomedical engineering consultancy. Currently it's still rather small,
> with
> > only 8 engineers, but I'm expecting & preparing to grow rapidly.
> >
> > My philosophy is always open-source and patent-free, so that's what my
> > consultancy—and for that matter, the charitable research that I fund
> > through it <https://sydneyscientific.org>—follows.
> >
> > The goal of everything we create is: interoperable (cross-platform,
> > cross-technology, cross-language, multi-cloud); open-source (Apache-2.0
> OR
> > MIT); with a view towards scaling:
> >
> >- teams;
> >- software-development <https://compilers.com.au>;
> >- infrastructure [this proposed Mesos contribution + our DevOps
> > tooling];
> >- [in the charity's case] facilitating very large-scale medical
> >diagnostic screening.
> >
> > Technologies like Mesos we expect to both optimise resource
> > allocation—reducing costs and increasing data locality—and award us
> > 'bragging rights' with which we can gain clients that are already using
> > Mesos (which, from my experience, is always big corporates… though
> > hopefully contributions like these will make it attractive to small
> > companies also).
> >
> > So no, we're not going anywhere, and are planning to maintain this
> library
> > into the future
> >
> > PS: Once accepted by Mesos, we'll be making similar contributions to
> other
> > Mesos ecosystem projects like Chronos <https://mesos.github.io/chronos>,
> > Marathon <https://github.com/mesosphere/marathon>, and Aurora
> > <https://github.com/aurora-scheduler/aurora> as well as to unrelated
> > projects (e.g., removing etcd as a hard-dependency from Kubernetes
> > <https://kubernetes.io>… enabling them to choose between ZooKeeper,
> etcd,
> > and Consul).
> >
> > Thanks for your continual feedback,
> >
> > *SAMUEL MARKS*
> > Sydney Medical School | Westmead Institute for Medical Research |
> > https://linkedin.com/in/samuelmarks
> > Director | Sydney Scientific Foundation Ltd <
> https://sydneyscientific.org>
> > | Offscale.io of Sydney Scientific Pty Ltd <https://offscale.io>
> >
> >
> > On Sat, Apr 18, 2020 at 6:58 AM Benjamin Mahler 
> > wrote:
> >
> > > Oh ok, could you tell us a little more about how you're using Mesos?
> And
> > > what offscale.io is?
> > >
> > > Strictly speaking, we don't really need packaging and releases as we
> can
> > > bundle the dependency in our repo and that's what we do for many of our
> > > dependencies.
> > > To me, the most important thing is the commitment to maintain the
> library
> > > and address issues that come up.
> > > I also would lean more towards a run-time flag rather than a build
> level
> > > flag, if possible.
> > >
> > >

Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-04-17 Thread Benjamin Mahler
Oh ok, could you tell us a little more about how you're using Mesos? And
what offscale.io is?

Strictly speaking, we don't really need packaging and releases as we can
bundle the dependency in our repo and that's what we do for many of our
dependencies.
To me, the most important thing is the commitment to maintain the library
and address issues that come up.
I also would lean more towards a run-time flag rather than a build level
flag, if possible.

I think the best place to start would be to put together a design doc. The
act of writing that will force the author to think through the details (and
there are a lot of them!), and we'll then get a chance to give feedback.
You can look through the mailing list for past examples of design docs (in
terms of which sections to include, etc).

How does that sound?

On Tue, Apr 14, 2020 at 8:44 PM Samuel Marks  wrote:

> Dear Benjamin Mahler [and *Developers mailing-list for Apache Mesos*],
>
> Thanks for responding so quickly.
>
> Actually this entire project I invested—time & money, including a
> development team—explicitly in order to contribute this to Apache Mesos. So
> no releases yet, because I wanted to ensure it was up to the specification
> requirements referenced in dev@mesos.apache.org before proceeding with
> packaging and releases.
>
> Tests have been setup in Travis CI for Linux (Ubuntu 18.04) and macOS,
> happy to set them up elsewhere also. There are also some Windows builds
> that need a bit of tweaking, then they will be pushed into CI also. We are
> just starting to do some work on reducing build & test times.
>
> Would be great to build a checklist of things you want to see before we
> send the PR, e.g.,
>
>- ☐ hosted docs;
>- ☐ CI/CD—including packaging—for Windows, Linux, and macOS;
>- ☐ releases on GitHub;
>- ☐ consistent session and auth interface
>- ☐ different tests [can you expand here?]
>
> This is just an example checklist, would be best if you and others can
> flesh it out, so when we do send the PR it's in an immediately mergable
> state.
>
> BTW: Originally had a debate with my team about whether to send a PR out of
> the blue—like Microsoft famously did for Node.js
> <https://github.com/nodejs/node/pull/4765>—or start an *offer thread* on
> the developers mailing-list.
>
> Looking forward to contributing 呂
>
> *SAMUEL MARKS*
> Sydney Medical School | Westmead Institute for Medical Research |
> https://linkedin.com/in/samuelmarks
> Director | Sydney Scientific Foundation Ltd <https://sydneyscientific.org>
> | Offscale.io of Sydney Scientific Pty Ltd <https://offscale.io>
>
>
> On Wed, Apr 15, 2020 at 2:38 AM Benjamin Mahler 
> wrote:
>
> > Thanks for reaching out, a well maintained and well written wrapper
> > interface to the three backends would certainly make this easier for us
> vs
> > implementing such an interface ourselves.
> >
> > Is this the client interface?
> >
> >
> https://github.com/offscale/liboffkv/blob/d31181a1e74c5faa0b7f5d7001879640b4d9f111/liboffkv/client.hpp#L115-L142
> >
> > At a quick glance, three ZK things that we rely on but seem to be absent
> > from the common interface is the ZK session, authentication, and
> > authorization. How will these be provided via the common interface?
> >
> > Here is our ZK interface wrapper if you want to see what kinds of things
> we
> > use:
> >
> >
> https://github.com/apache/mesos/blob/1.9.0/include/mesos/zookeeper/zookeeper.hpp#L72-L339
> >
> > The project has 0 releases and 0 issues, what kind of usage has it seen?
> > Has there been any testing yet? Would Offscale.io be doing some of the
> > testing?
> >
> > On Mon, Apr 13, 2020 at 7:54 PM Samuel Marks  wrote:
> >
> > > Apache ZooKeeper <https://zookeeper.apache.org> is a large dependency.
> > > Enabling developers and operations to use etcd <https://etcd.io>,
> Consul
> > > <https://consul.io>, or ZooKeeper should reduce resource utilisation
> and
> > > enable new use cases.
> > >
> > > There have already been a number of suggestions to get rid of hard
> > > dependency on ZooKeeper. For example, see: MESOS-1806
> > > <https://issues.apache.org/jira/browse/MESOS-1806>, MESOS-3574
> > > <https://issues.apache.org/jira/browse/MESOS-3574>, MESOS-3797
> > > <https://issues.apache.org/jira/browse/MESOS-3797>, MESOS-5828
> > > <https://issues.apache.org/jira/browse/MESOS-5828>, MESOS-5829
> > > <https://issues.apache.org/jira/browse/MESOS-5829>. However, there are
> > > difficulties in supporting a few implementations for d

Re: [AREA1 SUSPICIOUS] [OFFER] Remove ZooKeeper as hard-dependency, support etcd, Consul, OR ZooKeeper

2020-04-14 Thread Benjamin Mahler
Thanks for reaching out, a well maintained and well written wrapper
interface to the three backends would certainly make this easier for us vs
implementing such an interface ourselves.

Is this the client interface?
https://github.com/offscale/liboffkv/blob/d31181a1e74c5faa0b7f5d7001879640b4d9f111/liboffkv/client.hpp#L115-L142

At a quick glance, three ZK things that we rely on but seem to be absent
from the common interface is the ZK session, authentication, and
authorization. How will these be provided via the common interface?

Here is our ZK interface wrapper if you want to see what kinds of things we
use:
https://github.com/apache/mesos/blob/1.9.0/include/mesos/zookeeper/zookeeper.hpp#L72-L339

The project has 0 releases and 0 issues, what kind of usage has it seen?
Has there been any testing yet? Would Offscale.io be doing some of the
testing?

On Mon, Apr 13, 2020 at 7:54 PM Samuel Marks  wrote:

> Apache ZooKeeper  is a large dependency.
> Enabling developers and operations to use etcd , Consul
> , or ZooKeeper should reduce resource utilisation and
> enable new use cases.
>
> There have already been a number of suggestions to get rid of hard
> dependency on ZooKeeper. For example, see: MESOS-1806
> , MESOS-3574
> , MESOS-3797
> , MESOS-5828
> , MESOS-5829
> . However, there are
> difficulties in supporting a few implementations for different services
> with quite distinct data models.
>
> A few months ago offscale.io invested in a solution to this problem -
> liboffkv  – a *C++* library which
> provides a *uniform interface over ZooKeeper, Consul KV and etcd*. It
> abstracts common features of these services into its own data model which
> is very similar to ZooKeeper’s one. Careful attention was paid to keep
> methods both efficient and consistent. It is cross-platform,
> open-source (*Apache-2.0
> OR MIT*), and is written in C++, with vcpkg packaging, *C library output
>  >*,
> and additional interfaces in *Go *,
> *Java
> *, and *Rust
> *.
>
> Offscale.io proposes to replace all ZooKeeper usages in Mesos with usages
> of liboffkv. Since all interactions which require ZooKeeper in Mesos are
> conducted through the class Group (and GroupProcess) with a clear interface
> the obvious way to introduce changes is to provide another implementation
> of the class which uses liboffkv instead of ZooKeeper. In this case the
> original implementation may be left unchanged in the codebase and build
> flags to select from ZK-only and liboffkv variants may be introduced. Once
> the community is confident, you can decide to remove the ZK-only option,
> and instead only support liboffkv [which internally has build flags for
> each service].
>
> Removing the hard dependency on ZooKeeper will simplify local deployment
> for testing purposes as well as enable using Mesos in clusters without
> ZooKeeper, e.g. where etcd or Consul is used for coordination. We expect
> this to greatly reduce the amount of resource—network, CPU, disk,
> memory—usage in a datacenter environment.
>
> If the community accepts the initiative, we will integrate liboffkv into
> Mesos. We are also ready to develop the library and consider any suggested
> improvements.
> *SAMUEL MARKS*
> Sydney Medical School | Westmead Institute for Medical Research |
> https://linkedin.com/in/samuelmarks
> Director | Sydney Scientific Foundation Ltd 
> | Offscale.io of Sydney Scientific Pty Ltd 
> *SYDNEY SCIENTIFIC FOUNDATION and THE UNIVERSITY OF SYDNEY*
>
> PS: We will be offering similar contributions to Chronos
> , Marathon
> , Aurora
> , and related projects.
>


Welcome Andrei Sekretenko as a new committer and PMC member!

2020-01-21 Thread Benjamin Mahler
Please join me in welcoming Andrei Sekretenko as the newest committer and
PMC member!

Andrei has been active in the project for almost a year at this point and
has been a productive and collaborative member of the community.

He has helped out a lot with allocator work, both with code and
investigations of issues. He made improvements to multi-role framework
scalability (which includes the addition of the UPDATE_FRAMEWORK call), and
exposed metrics for per-role quota consumption.

He has also investigated, identified, and followed up on important bugs.
One such example is the message re-ordering issue he is currently working
on: https://issues.apache.org/jira/browse/MESOS-10023

Thanks for all your work so far Andrei, I'm looking forward to more of your
contributions in the project.

Ben


Re: RFC: Improving linting in Mesos (MESOS-9630)

2020-01-14 Thread Benjamin Mahler
Benjamin figured this out for me. For posterity, I needed to:

$ pyenv install 2.7.17
$ pyenv global 3.7.4 2.7.17

On Tue, Jan 14, 2020 at 2:37 PM Benjamin Mahler  wrote:

>  Have folks been able to set this up successfully on macOS? Is my python
> virtual env screwed up?
>
> ./support/setup-dev.sh
> [INFO] Installing environment for local.
> [INFO] Once installed this environment will be reused.
> [INFO] This may take a few minutes...
> An unexpected error has occurred: CalledProcessError: command:
> ('/usr/local/Cellar/pre-commit/1.21.0_1/libexec/bin/python3.8',
> '-mvirtualenv',
> '/Users/bmahler/.cache/pre-commit/repoxm_szyol/py_env-python2', '-p',
> 'python2')
> return code: 3
> expected return code: 0
> stdout:
> The path python2 (from --python=python2) does not exist
>
> stderr: (none)
> Check the log at /Users/bmahler/.cache/pre-commit/pre-commit.log
>
> On Wed, Sep 18, 2019 at 7:04 AM Benjamin Bannier 
> wrote:
>
>> Hello again,
>>
>> I have landed the patches for MESOS-9630 on the `master` branch, so we now
>> use pre-commit as linting framework.
>>
>> pre-commit primer
>> =
>>
>> 0. Install pre-commit, https://pre-commit.com/#install.
>>
>> 1. Run `./support/setup-dev.sh` to install hooks. We have broken
>> developer-related setup out of `./bootstrap` which by now only bootstraps
>> the autotools project while `support/setup-dev.sh` sets up developer
>> configuration files and git hooks.
>>
>> 2. As git hooks are global to a checkout and not tied to branches, you
>> might run into issues with the linter setup on older branches since
>> configuration files or scripts might not be present. You either should
>> setup that branch's linters with e.g., `./bootstrap`, or could silence
>> warnings from the missing linter setup with e.g.,
>>
>>$ PRE_COMMIT_ALLOW_NO_CONFIG=1 git commit
>>
>> 3. You can use the `SKIP` environment variable to disable certain linters,
>> e.g.,
>>
>># git-revert(1) often produces titles longer than 72 characters.
>>$ SKIP=gitlint git revert HEAD
>>
>>`SKIP` takes a linter `id` which you can look up in
>> `.pre-commit-config.yaml`.
>>
>> 4. We still use git hooks, but To explicitly lint your staged changes
>> before a commit execute
>>
>># Run all applicable linters,
>>$ pre-commit
>>
>># or a certain linter, e.g., `cpplint`.
>>$ pre-commit run cpplint
>>
>>pre-commit runs only on staged changes.
>>
>> 5. To run a full linting of the whole codebase execute
>>
>>$ SKIP=split pre-commit run -a
>>
>>We need to skip the `split` linter as it would complain about a mix of
>> files from stout, libprocess, and Mesos proper (it could be rewritten to
>> lift this preexisting condition).
>>
>> 6. pre-commit caches linter environments in `$XDG_CACHE_HOME/.pre-commit`
>> where `XDG_CACHE_HOME` is most often `$HOME/.cache`. While pre-commit
>> automatically sets up linter environments, cleanup is manual
>>
>># gc unused linter environments, e.g., after linter updates.
>>$ pre-commit gc
>>
>># Remove all cached environments.
>>$ pre-commit clean
>>
>> 7. To make changes to your local linting setup replace the symlink
>> `.pre-commit-config.yaml` with a copy of `support/pre-commit-config.yaml`
>> and adjust as needed. pre-commit maintains a listing of hooks of varying
>> quality, https://pre-commit.com/hooks.html and other linters can be added
>> pretty easily (see e.g., the `local` linters `split`, `license`, and
>> `cpplint` in our setup). Consider upstreaming whatever you found useful.
>>
>>
>>
>> Happy linting,
>>
>> Benjamin
>>
>> On Sat, Aug 17, 2019 at 2:12 PM Benjamin Bannier 
>> wrote:
>>
>> > Hi,
>> >
>> > I opened MESOS-9360[^1] to improve the way we do linting in Mesos some
>> time
>> > ago. I have put some polish on my private setup and now published it,
>> and
>> > am
>> > asking for feedback as linting is an important part of working with
>> Mesos
>> > for
>> > most of you. I have moved my workflow to pre-commit more than 6 months
>> ago
>> > and
>> > prefer it so much that I will not go back to `support/mesos-style.py`.
>> >
>> > * * *
>> >
>> > We use `support/mesos-style.py` to perform linting, most often triggered
>> > automatically when committing. This setup is powe

Re: RFC: Improving linting in Mesos (MESOS-9630)

2020-01-14 Thread Benjamin Mahler
 Have folks been able to set this up successfully on macOS? Is my python
virtual env screwed up?

./support/setup-dev.sh
[INFO] Installing environment for local.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
An unexpected error has occurred: CalledProcessError: command:
('/usr/local/Cellar/pre-commit/1.21.0_1/libexec/bin/python3.8',
'-mvirtualenv',
'/Users/bmahler/.cache/pre-commit/repoxm_szyol/py_env-python2', '-p',
'python2')
return code: 3
expected return code: 0
stdout:
The path python2 (from --python=python2) does not exist

stderr: (none)
Check the log at /Users/bmahler/.cache/pre-commit/pre-commit.log

On Wed, Sep 18, 2019 at 7:04 AM Benjamin Bannier 
wrote:

> Hello again,
>
> I have landed the patches for MESOS-9630 on the `master` branch, so we now
> use pre-commit as linting framework.
>
> pre-commit primer
> =
>
> 0. Install pre-commit, https://pre-commit.com/#install.
>
> 1. Run `./support/setup-dev.sh` to install hooks. We have broken
> developer-related setup out of `./bootstrap` which by now only bootstraps
> the autotools project while `support/setup-dev.sh` sets up developer
> configuration files and git hooks.
>
> 2. As git hooks are global to a checkout and not tied to branches, you
> might run into issues with the linter setup on older branches since
> configuration files or scripts might not be present. You either should
> setup that branch's linters with e.g., `./bootstrap`, or could silence
> warnings from the missing linter setup with e.g.,
>
>$ PRE_COMMIT_ALLOW_NO_CONFIG=1 git commit
>
> 3. You can use the `SKIP` environment variable to disable certain linters,
> e.g.,
>
># git-revert(1) often produces titles longer than 72 characters.
>$ SKIP=gitlint git revert HEAD
>
>`SKIP` takes a linter `id` which you can look up in
> `.pre-commit-config.yaml`.
>
> 4. We still use git hooks, but To explicitly lint your staged changes
> before a commit execute
>
># Run all applicable linters,
>$ pre-commit
>
># or a certain linter, e.g., `cpplint`.
>$ pre-commit run cpplint
>
>pre-commit runs only on staged changes.
>
> 5. To run a full linting of the whole codebase execute
>
>$ SKIP=split pre-commit run -a
>
>We need to skip the `split` linter as it would complain about a mix of
> files from stout, libprocess, and Mesos proper (it could be rewritten to
> lift this preexisting condition).
>
> 6. pre-commit caches linter environments in `$XDG_CACHE_HOME/.pre-commit`
> where `XDG_CACHE_HOME` is most often `$HOME/.cache`. While pre-commit
> automatically sets up linter environments, cleanup is manual
>
># gc unused linter environments, e.g., after linter updates.
>$ pre-commit gc
>
># Remove all cached environments.
>$ pre-commit clean
>
> 7. To make changes to your local linting setup replace the symlink
> `.pre-commit-config.yaml` with a copy of `support/pre-commit-config.yaml`
> and adjust as needed. pre-commit maintains a listing of hooks of varying
> quality, https://pre-commit.com/hooks.html and other linters can be added
> pretty easily (see e.g., the `local` linters `split`, `license`, and
> `cpplint` in our setup). Consider upstreaming whatever you found useful.
>
>
>
> Happy linting,
>
> Benjamin
>
> On Sat, Aug 17, 2019 at 2:12 PM Benjamin Bannier 
> wrote:
>
> > Hi,
> >
> > I opened MESOS-9360[^1] to improve the way we do linting in Mesos some
> time
> > ago. I have put some polish on my private setup and now published it, and
> > am
> > asking for feedback as linting is an important part of working with Mesos
> > for
> > most of you. I have moved my workflow to pre-commit more than 6 months
> ago
> > and
> > prefer it so much that I will not go back to `support/mesos-style.py`.
> >
> > * * *
> >
> > We use `support/mesos-style.py` to perform linting, most often triggered
> > automatically when committing. This setup is powerful, but also hard to
> > maintain and extend. pre-commit[^2] is a framework for managing Git
> commit
> > hooks which has an exciting set of features, one can often enough
> > configure it
> > only with YAML and comes with a long list of existing linters[^3]. Should
> > we
> > go with this approach we could e.g., trivially enable linters for
> Markdown
> > or
> > HTML (after fixing the current, sometimes wild state of the sources).
> >
> > I would encourage you to play with the [chain] ending in r/71300[^4] on
> > some
> > fresh clone (as this modifies your Git hooks). You need to install
> > pre-commit[^5] _before applying the chain_, and then run
> > `support/setup_dev.sh`. This setup mirrors the existing functionality of
> > `support/mesos-style.py`, but also has new linters activated. This should
> > present a pretty streamlined workflow. I have also adjusted the Windows
> > setup,
> > but not tested it.
> >
> > I have also spent some time to make transitioning from our current
> 

Re: [MESOS-10007] random "Failed to get exit status for Command" for short-lived commands

2019-10-21 Thread Benjamin Mahler
Hi Charles, thanks for the thorough ticket and for surfacing it here for
attention, it didn't get spotted amongst the JIRA noise.

I replied on the ticket with a patch that should fix the issue, we can
discuss further in the ticket.

Ben

On Sat, Oct 19, 2019 at 7:35 AM Charles-François Natali 
wrote:

> Hi,
>
> I'm wondering if there's anything I could do to help
> https://issues.apache.org/jira/browse/MESOS-10007 move forward?
>
> Basically it's a race condition in libprocess/command executor causing
> spurious errors to be reported for short-lived tasks.
> I've got a detailed code path of the race and a repro, however I'm not
> sure what's the best way to fix it - any suggestion?
>
> Cheers,
>
> Charles
>


Re: Which IDE do you recommend to import and develop Mesos and the cpusets module?

2019-09-25 Thread Benjamin Mahler
A number of us use Eclipse, some use vim / emacs or SublimeText.

Eclipse's c++ indexer has been working well for me.

On Fri, Sep 13, 2019 at 4:18 AM Felipe Gutierrez <
felipe.o.gutier...@gmail.com> wrote:

> Hi,
>
> I saw on the Mesos documentation [1] that cquery is recommended. I am not
> familiar with cquery, but it does not seem to be an IDE to develop.
> Although it says cquery provides IDE features. According to the
> documentation, I will need to install Ninja as well. And after all that I
> will need to use an editor [3]. Is that right?
>
> Then I found this post [2] which uses Eclipse. It seems easier, although I
> am not sure if it is the best option.
>
> What IDE do you guys use to develop or extend a Mesos module?
>
> [1] http://mesos.apache.org/documentation/latest/cquery/
> [2] https://d2iq.com/blog/how-to-build-mesos-on-mac-osx-eclipse
> [3] https://github.com/cquery-project/cquery/wiki/Editor-configuration
>
> *--*
> *-- Felipe Gutierrez*
>
> *-- skype: felipe.o.gutierrez*
> *--* *https://felipeogutierrez.blogspot.com
> *
>


Re: [VOTE] Release Apache Mesos 1.9.0 (rc1)

2019-08-27 Thread Benjamin Mahler
> We upgraded the version of the bundled boost very late in the release
cycle

Did we? We still bundle boost 1.65.0, just like we did during 1.8.x. We
just adjusted our special stripped bundle to include additional headers.

On Tue, Aug 27, 2019 at 1:39 PM Vinod Kone  wrote:

> -1
>
> We upgraded the version of the bundled boost very late in the release cycle
> which doesn't give downstream customers (who also depend on boost) enough
> time to vet any compatibility/perf/other issues. I propose we revert the
> boost upgrade (and the corresponding code changes depending on the upgrade)
> in 1.9.x branch but keep it in the master branch.
>
> On Tue, Aug 27, 2019 at 4:18 AM Qian Zhang  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.9.0.
> >
> >
> > 1.9.0 includes the following:
> >
> >
> 
> > * Agent draining
> > * Support configurable /dev/shm and IPC namespace.
> > * Containerizer debug endpoint.
> > * Add `no-new-privileges` isolator.
> > * Client side SSL certificate verification in Libprocess.
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc1
> >
> >
> 
> >
> > The candidate for Mesos 1.9.0 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz
> >
> > The tag to be voted on is 1.9.0-rc1:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc1
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1255
> >
> > Please vote on releasing this package as Apache Mesos 1.9.0!
> >
> > The vote is open until Friday, April 30 and passes if a majority of at
> > least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.9.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > Thanks,
> > Qian and Gilbert
> >
>


[Performance / Resource Management WG] August Update

2019-08-21 Thread Benjamin Mahler
Can't make today's meeting, so sending out some notes:

On the performance front:

- Long Fei reported a slow master, and perf data indicates a lot of time is
spent handling executor churn, this can be easily improved:
https://issues.apache.org/jira/browse/MESOS-9948

On the resource management front:

- Work on quota limits is wrapping up. The functionality is ready for
release, but some pieces are missing (e.g. there is no quota consumption
metric, it's only exposed in the /roles endpoint for now):
http://mesos.apache.org/documentation/latest/quota/

- The allocation cycle times in the LargeAndSmallQuota benchmark appears to
be ~2-3x worse from 1.8.1 as a result of implementing quota limits. The
non-quota cycle times appear similar. We will be attempting to bring this
back down closer to 1.8.1 but it's tight with the release planning to be
cut this week.

If folks have anything else to mention, or any comments or questions,
please chime in.

Ben


Re: Mesos 1.9.0 release

2019-08-13 Thread Benjamin Mahler
Thanks for taking this on Qian!

I seem to be unable to view the dashboard.
Also, when are we aiming to make the cut?

On Tue, Aug 13, 2019 at 10:58 PM Qian Zhang  wrote:

> Folks,
>
> It is time for Mesos 1.9.0 release and I am the release manager. Here is
> the dashboard:
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334125
>
> Please start to wrap up patches if you are contributing or shepherding any
> issues. If you expect any particular JIRA for this new release, please set
> "Target Version" as "1.9.0" and mark it as "Blocker" priority.
>
>
> Regards,
> Qian Zhang
>


[Performance / Resource Management WG] July Update

2019-07-17 Thread Benjamin Mahler
On the resource management front, Meng Zhu, Andrei Sekretenko, and myself
have been working on quota limits and enhancing multi-role framework
support:

- A memory leak in the allocator was fixed: MESOS-9852

- Support for quota limits work is well underway, and at this point the
major pieces are there and within the next few days it can be tried out.
The plan here is to try to push users towards using only quota limits, as
future support for optimistic offers will likely not support quota
guarantees.

- The /roles endpoint was fixed to expose all cases of "known" roles:
MESOS-9888 , MESOS-9890
.

- The /roles endpoint and roles table in the webui has been updated to
display quota consumption, as well as breakdowns of allocated, offered, and
reserved resources, see gif here:
https://reviews.apache.org/r/71059/file/1894/

- Several bugs were fixed for MULTI_ROLE framework support: MESOS-9856
, MESOS-9870


- The v0 scheduler driver and java/python bindings are being updated to
support multiple roles.


On the performance front:

- MESOS-9755 : William
Mahler and I looked into updating protobuf to 3.7.x from our existing 3.5.x
in order to attempt to use protobuf arenas in the master API and we noticed
a performance regression in the v0 /state endpoint. After looking into it,
it appears to be a performance regression in the protobuf reflection code
that we use to convert from our in-memory protobuf to json. No issue is
filed yet with the protobuf community, but it's worth also trying out
protobuf's built in json conversion code to see how that compares (see
MESOS-9896 ).


Feel free to reply with any questions or additional comments!

Ben


Re: Changing behaviour of suppressOffers() to preserve suppressed state on transparent re-registration by the scheduler driver

2019-06-23 Thread Benjamin Mahler
James, yes that's correct.

On Sat, Jun 22, 2019 at 12:05 AM James Peach  wrote:

> So this proposal would only affect schedulers using the libmesos scheduler
> driver API? Schedulers using the v1 HTTP would not get any changes in
> behaviour, right?
>
> > On Jun 21, 2019, at 9:56 PM, Andrei Sekretenko <
> asekrete...@mesosphere.io> wrote:
> >
> > Hi all,
> >
> > we are intending to change the behavior of the suppressOffers() method of
> > MesosSchedulerDriver with regard to the transparent re-registration.
> >
> > Currently, when driver becomes disconnected from a master, it performs on
> > its own a re-registration with an empty set of suppressed roles. This
> > causes un-suppression
> > of all the suppressed roles of the framework.
> >
> > The plan is to alter this behavior into preserving the suppression state
> on
> > this re-registration.
> >
> > The required set of suppressed roles will be stored in the driver, which
> > will be now performing re-registration with this set (instead of an empty
> > one),
> > and updating the stored set whenever a call modifying the suppression
> state
> > of the roles in the allocator is performed.
> > Currently, the driver has two methods which perform such calls:
> > suppressOffers()  and reviveOffers().
> >
> > Please feel free to raise any concerns or objections - especially if you
> > are aware of any V0 frameworks which (probably implicitly) depend on
> > un-suppression of the roles when this re-registration occurs.
> >
> >
> >
> > Note that:
> > - Frameworks which do not call suppressOffers() are, obviously,
> unaffected
> > by this change.
> >
> > - Frameworks that reliably prevent transparent-re-registration (for
> > example, by calling driver.abort() immediately from the disconnected()
> > callback), should also be not affected.
> >
> > - Storing the suppressed roles list for re-registration and clearing it
> in
> > reviveOffers() do not change anything for the existing frameworks. It is
> > setting this list in suppressOffers() which might be a cause of concerns.
> >
> > - I'm using the word "un-suppression" because re-registering with roles
> > removed from the suppressed roles list is NOT equivalent to performing
> > REVIVE call for these roles (unlike REVIVE, it does not clear
> offerFilters
> > in the allocator).
> >
> > =
> > A bit of background on why this change is needed.
> >
> > To properly support V0 frameworks with large number of roles, it is
> > necessary for the driver not to change the suppression state of the roles
> > on its own.
> > Therefore, due to the existence of the transparent re-registration in the
> > driver, we will need to store the required suppression state in the
> driver
> > and make it re-register using this state.
> >
> > We could possibly avoid the proposed change of suppressOffers() by adding
> > to the driver new interface for changing the suppression state, leaving
> > suppressOffers() as it is, and marking it as deprecated.
> >
> > However, this will leave the behaviour of suppressOffers() deeply
> > inconsistent with everything else.
> > Compare the following two sequences of events.
> > First one:
> > - The framework creates and starts a driver with roles "role1",
> "role2"...
> > "role500", the driver registers
> > - The framework calls a new method
> driver.suppressOffersForRoles({"role1",
> > ..., "role500"}), the driver performs SUPPRESS call for these roles and
> > stores them in its suppressed roles set.
> >   (Alternative with the same result: the framework calls
> > driver.updateFramework(FrameworkInfo, suppressedRoles={"role1", ...,
> > "role500"}), the driver performs UPDATE_FRAMEWORK call with those
> > parameters and stores the new suppressed roles set).
> > - The driver, due to some reason, disconnects and re-registers with the
> > same master, providing the stored suppressed roles set.
> > - All the roles are still suppressed
> > Second one:
> > - The framework creates and starts a driver with roles "role1",
> "role2"...
> > "role500", the driver registers
> > - The framework calls driver.suppressOffers(), the driver performs
> > SUPPRESS call for all roles, but doesn't modify required suppression
> state.
> > - The driver, due to some reason, disconnects and re-registers with the
> > same master, providing the stored suppressed roles set, which is empty.
> > - Now, none of the roles are suppressed, allocator generates offers for
> > 500 roles which will likely be declined by the framework.
> >
> > This is one of the examples which makes us strongly consider altering the
> > interaction between suppressOffers() and the transparent re-registration
> > when we add storing the suppression state to the driver.
> >
> > Regards,
> > Andrei Sekretenko
>
>


[Performance / Resource management WG] Notes in lieu of tomorrow's meeting

2019-05-13 Thread Benjamin Mahler
I'm out of the country and so I'm sending out notes in lieu of tomorrow's
performance / resource management meeting.

Resource Management:

- Work is underway for adding the UPDATE_FRAMEWORK scheduler::Call.
- Some fixes and small performance improvements landed for the random
sorter.
- Perf data from actual workload testing showed that the work we were doing
for incremental sorting wouldn't provide a significant impact, so that work
has been shelved for now.
- Logging was added to show the before and after state of quota headroom
during each allocation cycle.
- We will be getting back to the quota limits work shortly.

Performance:

- Some benchmarking was done for command health checks, which are rather
expensive since they create nested containers.
https://issues.apache.org/jira/browse/MESOS-9509

Please let us know if you have questions.

Ben


Performance / Resource Management Update

2019-04-17 Thread Benjamin Mahler
In lieu of today's meeting, this is an email update:

The 1.8 release process is underway, and it includes a few performance
related changes:

- Parallel reads for the v0 API have been extended to all other v0 read
only endpoints (e.g. /state-summary, /roles, etc). Whereas in 1.7.0, only
/state had the parallel read support. Also, requests are de-duplicated by
principal so that we don't perform redundant construction of responses if
we know they will be the same.

- The allocator performance has improved significantly when quota is in
use, benchmarking shows allocation cycle time reduced ~40% for a small size
cluster and up to ~70% for larger clusters.

- A per-framework (and per-role) override of the global
--min_allocatable_resources filter has been introduced. This lets
frameworks specify the minimum size of offers they want to see for their
roles, and improves scheduling efficiency by reducing the number of offers
declined for having insufficient resource quantities.

In the resource management area, we're currently working on the following
near term items:

- Investigating whether we can make some additional performance
improvements to the sorters (e.g. incremental sorting).
- Finishing the quota limits work, which will allow setting of limits
separate from guarantees.
- Adding an UPDATE_FRAMEWORK call to allow multi-role frameworks to change
their roles without re-subscribing.
- Exposing quota consumption via the API and UI (note that we currently
expose "allocation", but reservations are also considered against quota
consumption!)

There's lots more in the medium term, but I'll omit them here unless folks
are curious.

In the performance area, the following seem like the most pressing short
term items to me:

- Bring the v0 parallel read functionality to the v1 read-only calls.
- Bring v1 endpoint performance closer to v0.

Please chime in if there are any questions or comments,
Ben


Re: Subject: [VOTE] Release Apache Mesos 1.8.0 (rc1)

2019-04-15 Thread Benjamin Mahler
The CHANGELOG highlights seem a bit lacking?

- For some reason, the task CLI command is listed in a performance section?
- The parallel endpoint serving changes are in the longer list of items,
seems like we highlight them in the performance section? Maybe we could be
specific too about what we did additionally vs 1.7.0, since we already
announced parallel state reads in 1.7.0?
- We also have some additional allocator related performance improvements
in 1.8.0 that we need to add.
- Do we want to say something w.r.t.
https://issues.apache.org/jira/browse/MESOS-9211?

Not sure to what degree we care about an accurate CHANGELOG in the 1.8.0
tag vs updating the 1.8.0 CHANGELOG on master?

On Mon, Apr 15, 2019 at 2:26 PM Benno Evers  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.8.0.
>
>
> 1.8.0 includes the following:
>
> 
>  * Operation feedback for v1 schedulers.
>  * Per-framework minimum allocatable resources.
>  * New CLI subcommands `task attach` and `task exec`.
>  * New `linux/seccomp` isolator.
>  * Support for Docker v2 Schema2 manifest format.
>  * XFS quota for persistent volumes.
>  * **Experimental** Support for the new CSI v1 API.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc1
>
> 
>
> The candidate for Mesos 1.8.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc1/mesos-1.8.0.tar.gz
>
> The tag to be voted on is 1.8.0-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc1/mesos-1.8.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc1/mesos-1.8.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1251
>
> Please vote on releasing this package as Apache Mesos 1.8.0!
>
> The vote is open until Thursday, April 18 and passes if a majority of at
> least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.8.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Benno and Joseph
>


Re: [External] Re: docker containerizer with nvidia-docker

2019-04-05 Thread Benjamin Mahler
+Kevin Klues


On Fri, Apr 5, 2019 at 1:24 AM Huadong Liu  wrote:

> Hi Ben, thanks for pointing me to the docker containerizer ticket. I do see
> the value of UCR.
>
> Since nvidia-docker already takes care of mounting the driver etc., if we
> use the "--docker=nvidia-docker" agent option to replace the docker command
> with the nvidia-docker command, GPU support with the docker containerizer
> seems trivial. Did I miss anything?
>
> On Thu, Apr 4, 2019 at 8:00 PM Benjamin Mahler  wrote:
>
> > The "UCR" (aka mesos containerizer) and "Docker containerizer" are two
> > different containerizers that users tend to choose between. UCR is what
> > many of our serious users rely on and so we made the investment there
> > first. GPU support for the docker containerizer was also something that
> was
> > planned, but hasn't been prioritized:
> > https://issues.apache.org/jira/browse/MESOS-5795
> >
> > These days, many of our users use Docker images with UCR (i.e. bypassing
> > the need for the docker daemon).
> >
> > Maybe the containerization devs can chime in here I'm in saying anything
> > inaccurate or to shed some light on where things are headed.
> >
> > On Wed, Apr 3, 2019 at 2:21 PM Huadong Liu 
> wrote:
> >
> > > Hi,
> > >
> > > Nvidia GPU support in Mesos/Marathon mandates the mesos containerizer
> > > <
> > >
> >
> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/state/AppDefinition.scala#L557
> > > >
> > >  which "mimics" nvidia-docker.
> > > <http://mesos.apache.org/documentation/latest/gpu-support/> Can
> someone
> > > help me understand why docker containerizer with agent option
> > > "--docker=nvidia-docker" wasn't the choice? Thank you!
> > >
> > > --
> > > Huadong
> > >
> >
>


Re: docker containerizer with nvidia-docker

2019-04-04 Thread Benjamin Mahler
The "UCR" (aka mesos containerizer) and "Docker containerizer" are two
different containerizers that users tend to choose between. UCR is what
many of our serious users rely on and so we made the investment there
first. GPU support for the docker containerizer was also something that was
planned, but hasn't been prioritized:
https://issues.apache.org/jira/browse/MESOS-5795

These days, many of our users use Docker images with UCR (i.e. bypassing
the need for the docker daemon).

Maybe the containerization devs can chime in here I'm in saying anything
inaccurate or to shed some light on where things are headed.

On Wed, Apr 3, 2019 at 2:21 PM Huadong Liu  wrote:

> Hi,
>
> Nvidia GPU support in Mesos/Marathon mandates the mesos containerizer
> <
> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/state/AppDefinition.scala#L557
> >
>  which "mimics" nvidia-docker.
>  Can someone
> help me understand why docker containerizer with agent option
> "--docker=nvidia-docker" wasn't the choice? Thank you!
>
> --
> Huadong
>


Re: Bundled glog update from 0.3.3 to 0.4.0

2019-03-27 Thread Benjamin Mahler
Thanks Andrei!

Some interesting changes for us from what I see:
  - Looks like there are some potential memory allocation reduction changes
which is nice. ("reduce dynamic allocation from 3 to 1 per log message" in
0.3.4)
  - https://github.com/google/glog/pull/245 (this will change the log file
names for those that see 'invalid-user' in the filenames, which I recall
seeing often)
  - https://github.com/google/glog/pull/145 (this fixes the issue we filed
https://github.com/google/glog/issues/84 where we've had to disable
GLOG_drop_log_memory).

After this update we should be able to remove our special case disablement
of GLOG_drop_log_memory:
https://github.com/apache/mesos/blob/1.7.2/src/logging/logging.cpp#L184-L194

Is there a ticket for the glog upgrade to 0.4.0? I filed
https://issues.apache.org/jira/browse/MESOS-9680 but couldn't find the
0.4.0 ticket to link that it's blocked by the upgrade.

On Tue, Mar 26, 2019 at 9:17 AM Andrei Sekretenko <
asekrete...@mesosphere.com> wrote:

> Hi all,
> We are intending to update the bundled glog from 0.3.3 to 0.4.0.
>
> If you have any objections/concerns, or know about any issues introduced
> into glog between 0.3.3 and 0.4.0, please raise them.
>
> Corresponding glog changelogs:
> https://github.com/google/glog/releases/tag/v0.4.0
> https://github.com/google/glog/releases/tag/v0.3.5
> https://github.com/google/glog/releases/tag/v0.3.4
>
> Regards,
> Andrei Sekretenko
>


Re: [MESOS-8248] - Expose information about GPU assigned to a task

2019-03-22 Thread Benjamin Mahler
Containers can be assigned multiple GPUs, so I assume you're thinking of
putting these metrics in a repeated message? (similar to DiskStatistics)

It has seemed to me we should probably make this Nvidia specific (e.g.
NvidiaGPUStatistics). In the past we thought generalizing this would be
good, but there's only Nvidia support at the moment and we haven't been
able to make sure that other GPU libraries provide the same information.

For each metric can you also include the relevant calls from NVML for
obtaining the information? Can you also highlight what cadvisor provides to
make sure we don't miss anything? From my read of their code, it seems to
be a subset of what you listed?
https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246

On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado  wrote:

> another way would be to just use cadvisor
>
> > On 22 Mar 2019, at 08:35, Jorge Machado  wrote:
> >
> > Hi Mesos devs,
> >
> > In our use case from mesos we need to get gpu resource usage per task
> and build dashboards on grafana for it.  Getting the metrics to Grafana we
> will send the metrics to prometheus the main problem is how to get the
> metrics in a reliable way.
> > I proposing the following:
> >
> > Changing the mesos.proto and mesos.proto under v1 and on
> ResourceStatistics message add:
> >
> > //GPU statistics for each container
> > optional int32 gpu_idx = 50;
> > optional string gpu_uuid = 51;
> > optional string device_name = 52;
> > optional uint64 gpu_memory_used_mb = 53;
> > optional uint64 gpu_memory_total_mb = 54;
> > optional double gpu_usage = 55;
> > optional int32 gpu_temperature = 56;
> > optional int32 gpu_frequency_MHz = 57;
> > optional int32 gpu_power_used_W = 58;
> >
> > For starters I would like to change NvidiaGpuIsolatorProcess at
> isolator.cpp and there get the nvml call for the usage method. As I’m new
> to this I need some guidelines please.
> >
> > My questions:
> >
> > Does the NvidiaGpuIsolatorProcess runs already inside the container or
> just outside in the agent ? (I’m assuming outside)
> > From what I saw on the cpu metrics they are gathered inside the
> container for the gpu we could do it in the NvidiaGpuIsolatorProcess and
> get the metrics via the host.
> > Anything more that I should check ?
> >
> > Thanks a lot
> >
> > Jorge Machado
> > www.jmachado.me
> >
> >
> >
> >
> >
>
>


Re: Design Doc: Metrics subset access

2019-03-15 Thread Benjamin Mahler
Thanks Benno, this has come up before, mainly in the context of reducing
cost of computing / serving / processing large numbers of metrics.

However, in that use case, a single prefix wasn't sufficient because the
user would be interested in the subset of the metrics that they're using
for graphs and alerts, and these probably will not share a non-empty
prefix. So, the thinking was that multiple prefixes would be needed (which
doesn't work in the case of path parameters). The thinking was also to
avoid the alternative of wildcard patterns to start with (e.g.
/master/frameworks/*/tasks_running).

To give some context on why "/snapshot" is there: originally when the
metrics library was implemented, it was envisioned that there might be
multiple endpoints to read the data (e.g. "/snapshot" is current values,
"history" might expose historical timeseries, etc). In retrospect I don't
think there will be any other support other than "give me the current
values", so attempting to get rid of the "/snapshot" suffix sounds good.
But, this is orthogonal to whether a prefix path parameter or query
parameter is added, no?

On Thu, Mar 14, 2019 at 10:03 PM Benno Evers  wrote:

> Hi all,
>
> while this proposal/idea is a very small change code-wise, but it would be
> employing libprocess HTTP routing logic in an afaik unprecedented way, so I
> wanted to open this up for discussion.
>
> # Motivation
>
> Currently, the only way to access libprocess metrics is via the
> `metrics/snapshot` endpoint, which returns the current values of all
> installed metrics.
>
> If the caller is only interested in a specific metric, or a subset of the
> metrics, this is wasteful in two ways: First the process has to do extra
> work to collect these metrics, and second the caller has to do extra work
> to filter out the unneeded metrics.
>
> # Proposal
> I'm proposing to have the `/metrics/` endpoint being able to be followed by
> an arbitrary path. The returned returned JSON object will contain only
> those metrics whose key begins with the specified path:
>
> `/metrics` -> Return all metrics
> `/metrics/master/messages` -> Return all metrics beginning with
> `master/messages`, e.g. `master/messages_launch_tasks`, etc.
>
> A proof of concept implementation can be found here:
> https://reviews.apache.org/r/70211
>
> # Discussion
> The current naming conventions for metrics, i.e. `master/tasks_killed`,
> suggests to the casual observer that metrics are stored and accessible in a
> hierarchical manner. Using a prefix filter allows users to filter certain
> parts of the metrics as if they were indeed hierarchical, while still
> allowing libprocess to use a flat namespace for all metric names
> internally.
>
> The method of access, using the url path directly instead of a query
> parameter, is unusual but it has the advantage that, in my obervations, it
> matches what people intuitively try to do anyways when they want to access
> a subset of metrics.
>
> One other drawback is that all other routes of the MetricsProcess will
> shadow the corresponding filter value, e.g. in right now it would not be
> possible to return all metrics whose names begin with 'snapshot/'.
>
> # Alternatives
> 1) Add a `prefix` parameter to the `snapshot` endpoint, i.e.
>
> `/metrics/snapshot?prefix=/master/cpu`
>
> This is more in line with how we classically do libprocess endpoints, but
> from a UI perspective it's hard to discover: Many people, including some
> Mesos developers, already have trouble remembering to append `/snapshot` to
> get the metrics, so requiring to memorize an additional parameter does not
> seem nice.
>
> 2) Move the dynamic prefix under some other endpoint `/values`, i.e.
>
> /metrics/values/master/messages`
>
> This has the main disadvantage that /values (with empty filter) and
> /snapshot will return exactly the same data, begging the question why both
> are needed.
>
>
> What do you think? I'm looking forward to hear your thoughts, ideas, etc.
>
> Best regards,
> --
> Benno Evers
> Software Engineer, Mesosphere
>


[Performance / Resource Management WG] Meeting Cancelled Today

2019-02-20 Thread Benjamin Mahler
I was not able to wrangle together performance content for today's meeting.

On the resource management side, the design is nearly finalized for
supporting quota limits distinct from quota guarantees, in a flat role
model (no hierarchy).

As an early FYI, as a result of the complexity of guarantees, we're also
thinking about getting rid of guarantees in the longer term in favor of
purely limits + priorities + preemption (fair sharing / priority-based).

Ben


Re: Welcome Benno Evers as committer and PMC member!

2019-01-30 Thread Benjamin Mahler
Welcome Benno! Thanks for all the great contributions

On Wed, Jan 30, 2019 at 6:21 PM Alex R  wrote:

> Folks,
>
> Please welcome Benno Evers as an Apache committer and PMC member of the
> Apache Mesos!
>
> Benno has been active in the project for more than a year now and has made
> significant contributions, including:
>   * Agent reconfiguration, MESOS-1739
>   * Memory profiling, MESOS-7944
>   * "/state" performance improvements, MESOS-8345
>
> I have been working closely with Benno, paired up on, and shepherded some
> of his work. Benno has very strong technical knowledge in several areas and
> he is willing to share it with others and help his peers.
>
> Benno, thanks for all your contributions so far and looking forward to
> continuing to work with you on the project!
>
> Alex.
>


[Performance / Resource Management WG] Meeting Notes - January 16

2019-01-17 Thread Benjamin Mahler
Thanks to the folks who joined: Ilya Pronin, James Peach, Meng Zhu,
Chun-Hung Hsiao, Colin Dunn, Pawel Palucki, Maciej Iwanowski (hopefully I
didn't forget anyone)

This month Maciej and Pawel from Intel joined to present their work on
interference detection / remediation. You can see the slides here along
with other references to their work:

https://github.com/intel/owca/blob/master/docs/Orchestration-aware%20workload%20collocation%20agent%20-%20deep%20dive.pdf

The video is here:

https://zoom.us/recording/play/mq26x45sM0Ec_knkVwlJ6nZVs27KHAnnXsUiNkk3KqBsra9EHRLaZxvENDPEjKcB

Ben


Re: [VOTE] Release Apache Mesos 1.7.1 (rc1)

2019-01-02 Thread Benjamin Mahler
+1 (binding)

make check passes on macOS 10.14.2

$ clang++ --version
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ ./configure CC=clang CXX=clang++ CXXFLAGS="-Wno-deprecated-declarations"
--disable-python --disable-java --with-apr=/usr/local/opt/apr/libexec
--with-svn=/usr/local/opt/subversion && make check -j12
...
[  PASSED  ] 1956 tests.

On Fri, Dec 21, 2018 at 5:48 PM Chun-Hung Hsiao  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.1.
>
>
> 1.7.1 includes the following:
>
> 
> * This is a bug fix release. Also includes performance and API
>   improvements:
>
>   * **Allocator**: Improved allocation cycle time substantially
> (see MESOS-9239 and MESOS-9249). These reduce the allocation
> cycle time in some benchmarks by 80%.
>
>   * **Scheduler API**: Improved the experimental `CREATE_DISK` and
> `DESTROY_DISK` operations for CSI volume recovery (see MESOS-9275
> and MESOS-9321). Storage local resource providers now return disk
> resources with the `source.vendor` field set, so frameworks needs to
> upgrade the `Resource` protobuf definitions.
>
>   * **Scheduler API**: Offer operation feedbacks now present their agent
> IDs and resource provider IDs (see MESOS-9293).
>
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.1-rc1
>
> 
>
> The candidate for Mesos 1.7.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz
>
> The tag to be voted on is 1.7.1-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.1-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
>
> https://repository.apache.org/content/repositories/releases/org/apache/mesos/mesos/1.7.1-rc1/
>
> Please vote on releasing this package as Apache Mesos 1.7.1!
>
> To accommodate for the holidays, the vote is open until Mon Dec 31
> 14:00:00 PST 2018 and passes if a majority of at least 3 +1 PMC votes are
> cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


Reminder: Performance / Resource Management WGs meeting today at 10am PST.

2018-12-19 Thread Benjamin Mahler
This will be a combined meeting of the two working groups, and will be the
first meeting for the Resource Management working group.

See you there!


"Resource Management" Working Group

2018-12-18 Thread Benjamin Mahler
Over the past few months, we've been making several improvements related to
quota enforcement, using multiple roles / schedulers, as well as
scalability of the allocator in the master.

Going forward we'll be shifting back to push additional features, like
quota limits, hierarchical quota, priority based preemption. As well as
tackling an "optimistic" offer model to further improve scheduling
throughput in the presence of multiple schedulers.

Since there's a lot going on in this area. I'm proposing we create a
"Resource Management" working group.

Much of the recent resource management work has been performance oriented;
since most of it impacts allocation cycle time and/or scheduling
throughput. So, I'm thinking we can start by covering resource management
in the existing performance working group meetings, and split them later.

The first meeting will occur tomorrow, combined with the performance
meeting.


Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-10 Thread Benjamin Mahler
I think we're agreed:

-There are no schedulers modeling the existing per-agent time-based
filters that mesos is tracking, and we shouldn't go in a direction that
encourages frameworks to try to model and manage these. So, we should be
very careful in considering something like CLEAR_FILTERS. We're probably
also agreed that the current filters aren't so great. :)
-Letting a scheduler have more explicit control over the offers it gets
(both in shape of the offers and overall quantity of resources) is a good
direction to go in to reduce the inefficiency in the pessimistic offer
model.
-Combining matchers of model (2) with REVIVE may eliminate the need for
CLEAR_FILTERS. I think once you have global matchers in play, it eliminates
the need for the existing decline filters to involve resource subsets and
we may be able to move new schedulers forward with a better model without
breaking old schedulers.

I don’t think model (1) was understood as intended. Schedulers would not be
expressing limits, they would be expressing a "request" equivalent to “how
much more they want”. The internal effective limit (equal to
allocation+request) is just an implementation detail here that demonstrates
how it fits cleanly into the allocation algorithm. So, if a scheduler needs
to run 10 tasks with [1 cpu, 10GB mem], they would express a request of
[10cpus ,100GB mem] regardless of how much else is already allocated at
that role/scheduler node.

>From a scheduler's perspective the difference between the two models is:

(1) expressing "how much more" you need
(2) expressing an offer "matcher"

So:

(1) covers the middle part of the demand quantity spectrum we currently
have: unsuppressed -> infinite additional demand, suppressed -> 0
additional demand, and now also unsuppressed w/ request of X -> X
additional demand

(2) is a global filtering mechanism to avoid getting offers in an unusable
shape

They both solve inefficiencies we have, and they're complementary: a
"request" could actually consist of (1) and (2), e.g. "I need an additional
10 cpus, 100GB mem, and I want offers to contain [1cpu, 10GB mem]".

I'll schedule a meeting to discuss further. We should also make sure we
come back to the original problem in this thread around REVIVE retries.

On Mon, Dec 10, 2018 at 11:58 AM Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

> Hi Ben et al.,
>
> I'd expect frameworks to *always* know how to accept or decline offers in
> general. More involved frameworks might know how to suppress offers. I
> don't expect that any framework models filters and their associated
> durations in detail (that's why I called them a Mesos implementation
> detail) since there is not much benefit to a framework's primary goal of
> running tasks as quickly as possible.
>
> > I couldn't quite tell how you were imagining this would work, but let me
> spell out the two models that I've been considering, and you can tell me if
> one of these matches what you had in mind or if you had a different model
> in mind:
>
> > (1) "Effective limit" or "give me this much more" ...
>
> This sounds more like an operator-type than a framework-type API to me.
> I'd assume that frameworks would not worry about their total limit the way
> an operator would, but instead care about getting resources to run a
> certain task at a point in time. I could also imagine this being easy to
> use incorrectly as frameworks would likely need to understand their total
> limit when issuing the call which could require state or coordination among
> internal framework components (think: multi-purpose frameworks like
> Marathon or Aurora).
>
> > (2) "Matchers" or "give me things that look like this": when a scheduler
> expresses its "request" for a role, it would act as a "matcher" (opposite
> of filter). When mesos is allocating resources, it only proceeds if
> (requests.matches(resources) && !filters.filtered(resources)). The open
> ended aspect here is what a matcher would consist of. Consider a case where
> a matcher is a resource quantity and multiple are allowed; if any matcher
> matches, the result is a match. This would be equivalent to letting
> frameworks specify their own --min_allocatable_resources for a role (which
> is something that has been considered). The "matchers" could be more
> sophisticated: full resource objects just like filters (but global), full
> resource objects but with quantities for non-scalar resources like ports,
> etc.
>
> I was thinking in this direction, but what you described is more involved
> than what I had in mind as a possible first attempt. I'd expect that
> frameworks currently use `REVIVE` as a proxy for `REQUEST_RESOURCES`, not
> as a way to manage their filter state tracked in the allocator. Assuming we
> have some way to express resource quantities (i.e., MESOS-9314), we should
> be able to improve on `REVIVE` by providing a `REQUEST_RESOURCES` which
> clears all filters for resource containing the requested resources 

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-05 Thread Benjamin Mahler
Thanks for bringing REQUEST_RESOURCES up for discussion, it's one of the
mechanisms that we've been considering for further scaling pessimistic
offers before we make the migration to optimistic offers. It's also been
referred to as "demand" rather than "request", but for the sake of this
discussion consider them the same.

I couldn't quite tell how you were imagining this would work, but let me
spell out the two models that I've been considering, and you can tell me if
one of these matches what you had in mind or if you had a different model
in mind:

(1) "Effective limit" or "give me this much more": when a scheduler
expresses its "request" for a role, it would be equivalent to setting an
"effective limit" on the framework leaf node underneath the role node (i.e.
.../role/). The effective limit would probably be set to
(request + existing .../role/ wrote:

> Hi Meng,
>
> thanks for the proposal, I agree that the way these two aspects are
> currently entangled is an issue (e.g., for master/allocator performance
> reasons). At the same time, the workflow we currently expect frameworks to
> follow is conceptually not hard to grasp,
>
> (1) If framework has work then
> (i) put framework in unsuppressed state,
> (ii) decline not matching offers with a long filter duration.
> (2) If an offer matches, accept.
> (3) If there is no more work, suppress. GOTO (1).
>
> Here the framework does not need to track its filters across allocation
> cycles (they are an unexposed implementation detail of the hierarchical
> allocator anyway) which e.g., allows metaschedulers like Marathon or Apache
> Aurora to decouple the scheduling of different workloads. A downside of
> this interface is that
>
> * there is little incentive for frameworks to use SUPPRESS in addition to
> filters, and
> * unsupression is all-or-nothing, forcing the master to send potentially
> all unused resources to one framework, even if it is only interested in a
> fraction. This can cause, at least temporal, non-optimal allocation
> behavior.
>
> It seems to me that even though adding UNSUPPRESS and CLEAR_FILTERS would
> give frameworks more control, it would only be a small improvement. In
> above framework workflow we would allow a small improvement if the
> framework knows that a new workload matches a previously running workflow
> (i.e., it can infer that no filters for the resources it is interested in
> is active) so that it can issue UNSUPPRESS instead of CLEAR_FILTERS.
> Incidentally, there seems little local benefit for frameworks to use these
> new calls as they’d mostly help the master and I’d imagine we wouldn’t want
> to imply that clearing filters would unsuppress the framework. This seems
> too little to me, and we run the danger that frameworks would just always
> pair UNSUPPRESS and CLEAR_FILTERS (or keep using REVIVE) to simplify their
> workflow. If we’d model the interface more along framework needs, there
> would be clear benefit which would help adoption.
>
> A more interesting call for me would be REQUEST_RESOURCES. It maps very
> well onto framework needs (e.g., “I want to launch a task requiring these
> resources”), and clearly communicates a requirement to the master so that
> it e.g., doesn’t need to remove all filters for a framework. It also seems
> to fit the allocator model pretty well which doesn’t explicitly expose
> filters. I believe implementing it should not be too hard if we'd restrict
> its semantics to only communicate to the master that a framework _is
> interested in a certain resource_ without promising that the framework
> _will get them in any amount of time_ (i.e., no need to rethink DRF
> fairness semantics in the hierarchical allocator). I also feel that if we
> have REQUEST_RESOURCES we would have some freedom to perform further
> improvements around filters in the master/allocator (e.g., filter
> compatification, work around increasing the default filter duration, …).
>
>
> A possible zeroth implementation for REQUEST_RESOURCES with the
> hierarchical allocator would be to have it remove any filters containing
> the requested resource and likely to unsuppress the framework. A
> REQUEST_RESOURCES call would hold an optional resource and an optional
> AgentID; the case where both are empty would map onto CLEAR_FILTERS.
>
>
> That being said, it might still be useful to in the future expose a
> low-level knob for framework allowing them to explicitly manage their
> filters.
>
>
> Cheers,
>
> Benjamin
>
>
> On Dec 4, 2018, at 5:44 AM, Meng Zhu  wrote:
> >
> > See my comments inline.
> >
> > On Mon, Dec 3, 2018 at 5:43 PM Vinod Kone  wrote:
> >
> >> Thanks Meng for the explanation.
> >>
> >> I imagine most frameworks do not remember what stuff they filtered much
> >> less figure out how previously filtered stuff  can satisfy new
> operations.
> >> That sounds complicated!
> >>
> >
> > Frameworks do not need to remember what filters they currently have. Only
> > knowing
> > the resource profiles of the current 

Re: [API WG] Proposals for dealing with master subscriber leaks.

2018-11-11 Thread Benjamin Mahler
>- We can add heartbeats to the SUBSCRIBE call.
> This would need to be
>  part of a separate operator Call, because one platform (browsers) that
> might subscribe to the master does not support two-way streaming.

This doesn't make sense to me, the heartbeats should still be part of the
same connection (request and response are infinite and heartbeating) by
default. Splitting into a separate call is messy and shouldn't be what we
force everyone to do, it should only be done in cases that it's impossible
to use a single connection (e.g. browsers).

On Sat, Nov 10, 2018 at 12:03 AM Joseph Wu  wrote:

> Hi all,
>
> During some internal scale testing, we noticed that, when Mesos streaming
> endpoints are accessed via certain proxies (or load balancers), the proxies
> might not close connections after they are complete.  For the Mesos master,
> which only has the /api/v1 SUBSCRIBE streaming endpoint, this can generate
> unnecessary authorization requests and affects performance.
>
> We are considering a few potential solutions:
>
>- We can add heartbeats to the SUBSCRIBE call.  This would need to be
>part of a separate operator Call, because one platform (browsers) that
>might subscribe to the master does not support two-way streaming.
>- We can add (optional) arguments to the SUBSCRIBE call, which tells the
>master to disconnect it after a while.  And the client would have to
> remake
>the connection every so often.
>- We can change the master to hold subscribers in a circular buffer, and
>disconnect the oldest ones if there are too many connections.
>
> We're tracking progress on this issue here:
> https://issues.apache.org/jira/browse/MESOS-9258
> Some prototypes of the code changes involved are also linked in the JIRA.
>
> Please chime in if you have any suggestions or if any of these options
> would be undesirable/bad,
> ~Joseph
>


Parallel test runner now the default for autotools / make check

2018-11-08 Thread Benjamin Mahler
During the MesosCon hackathon Benjamin Bannier and myself worked on getting
the parallel test runner usable as the default for the autotools build.

Now, when running 'make check', the tests will run much much faster!

What we did:

-added detection of 'ulimit -u' and exit if too low
-fixed some flaky tests
-changed the default 'configure' behavior to require opt-out of the
parallel test runner
-fixed reviewbot to oot-

What's still left to do:

-fix any other flaky tests that crop up
-update CMake build to use the parallel test runner
-improve the output and subprocess management behavior of the runner

Give it a shot (might need to reconfigure before running make check, or you
can run it manually through CMake build) and raise any issues you see.

bmahler & bbannier


Welcome Meng Zhu as PMC member and committer!

2018-10-31 Thread Benjamin Mahler
Please join me in welcoming Meng Zhu as a PMC member and committer!

Meng has been active in the project for almost a year and has been very
productive and collaborative. He is now one of the few people of
understands the allocator code well, as well as the roadmap for this area
of the project. He has also found and fixed bugs, and helped users in slack.

Thanks for all your work so far Meng, I'm looking forward to more of your
contributions in the project.

Ben


Re: Dedup mesos agent status updates at framework

2018-10-29 Thread Benjamin Mahler
The timeout behavior sounds like a dangerous scalability tripwire. Consider
revisiting that approach.

On Sun, Oct 28, 2018 at 10:42 PM Varun Gupta 
wrote:

> Mesos Version: 1.6
>
> scheduler has 250k events in its queue: Master master sends status updates
> to scheduler, and scheduler stores them in the queue. Scheduler process in
> FIFO, and once processed (includes persisting to DB) it ack the update.
> These updates are processed asynchronously with a thread pool of 1000 size.
> We are using explicit reconciliation.
> If the ack to Mesos Master is timing out, due to high CPU usage then next
> ack will likely fail too. It slows down processing on Scheduler side,
> meanwhile Mesos Master continuous to send status updates (duplicate ones,
> since old status updates are not ack). This leads to building up of status
> updates at Scheduler to be processed, and we have seen it to grow upto 250k
> status updates.
>
> Timeout is the explicit ack request from Scheduler to Mesos Master.
>
> Mesos Master profiling: Next time, when this issue occurs, I will take the
> dump.
>
> Deduplication is for the status updates present in the queue for scheduler
> to process, idea is to dedup duplicate status updates such that scheduler
> only processes same status update pending in queue once, and ack to Mesos
> Master also ones. It will reduce the load for both Scheduler and Mesos
> Master. After the ack (success/fail) scheduler will remove the status
> update from the queue, and in case of failure, Mesos Master will send
> status update again.
>
>
>
> On Sun, Oct 28, 2018 at 10:15 PM Benjamin Mahler 
> wrote:
>
> > Which version of mesos are you running?
> >
> > > In framework, event updates grow up to 250k
> >
> > What does this mean? The scheduler has 250k events in its queue?
> >
> > > which leads to cascading effect on higher latency at Mesos Master (ack
> > requests with 10s timeout)
> >
> > Can you send us perf stacks of the master during such a time window so
> > that we can see if there are any bottlenecks?
> > http://mesos.apache.org/documentation/latest/performance-profiling/
> >
> > Where is this timeout coming from and how is it used?
> >
> > > simultaneously explore if dedup is an option
> >
> > I don't know what you're referring to in terms of de-duplication. Can you
> > explain how the scheduler's status update processing works? Does it use
> > explicit acknowledgements and process batches asynchronously? Aurora
> > example: https://reviews.apache.org/r/33689/
> >
> > On Sun, Oct 28, 2018 at 8:58 PM Varun Gupta 
> > wrote:
> >
> >> Hi Benjamin,
> >>
> >> In our batch workload use case, number of tasks churn is pretty high. We
> >> have seen 20-30k tasks launch within 10 second window and 100k+ tasks
> >> running.
> >>
> >> In framework, event updates grow up to 250k, which leads to cascading
> >> effect on higher latency at Mesos Master (ack requests with 10s timeout)
> >> as
> >> well as blocking framework to process new since there are too many left
> to
> >> be acknowledged.
> >>
> >> Reconciliation is every 30 mins which also adds pressure on event stream
> >> if
> >> too many unacknowledged.
> >>
> >> I am thinking to experiment with default backoff period from 10s -> 30s
> or
> >> 60s, and simultaneously explore if dedup is an option.
> >>
> >> Thanks,
> >> Varun
> >>
> >> On Sun, Oct 28, 2018 at 6:49 PM Benjamin Mahler 
> >> wrote:
> >>
> >> > Hi Varun,
> >> >
> >> > What problem are you trying to solve precisely? There seems to be an
> >> > implication that the duplicate acknowledgements are expensive. They
> >> should
> >> > be low cost, so that's rather surprising. Do you have any data related
> >> to
> >> > this?
> >> >
> >> > You can also tune the backoff rate on the agents, if the defaults are
> >> too
> >> > noisy in your setup.
> >> >
> >> > Ben
> >> >
> >> > On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta  wrote:
> >> >
> >> > >
> >> > > Hi,
> >> > >>
> >> > >> Mesos agent will send status updates with exponential backoff until
> >> ack
> >> > >> is received.
> >> > >>
> >> > >> If processing of events at framework and sending ack to Master is
> >> > running
> >> > >> slow then it builds a back pressure at framework due to duplicate
> >> > updates
> >> > >> for same status.
> >> > >>
> >> > >> Has someone explored the option to dedup same status update event
> at
> >> > >> framework or is it even advisable to do. End goal is to dedup all
> >> events
> >> > >> and send only one ack back to Master.
> >> > >>
> >> > >> Thanks,
> >> > >> Varun
> >> > >>
> >> > >>
> >> > >>
> >> >
> >>
> >
>


Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Benjamin Mahler
Which version of mesos are you running?

> In framework, event updates grow up to 250k

What does this mean? The scheduler has 250k events in its queue?

> which leads to cascading effect on higher latency at Mesos Master (ack
requests with 10s timeout)

Can you send us perf stacks of the master during such a time window so that
we can see if there are any bottlenecks?
http://mesos.apache.org/documentation/latest/performance-profiling/

Where is this timeout coming from and how is it used?

> simultaneously explore if dedup is an option

I don't know what you're referring to in terms of de-duplication. Can you
explain how the scheduler's status update processing works? Does it use
explicit acknowledgements and process batches asynchronously? Aurora
example: https://reviews.apache.org/r/33689/

On Sun, Oct 28, 2018 at 8:58 PM Varun Gupta  wrote:

> Hi Benjamin,
>
> In our batch workload use case, number of tasks churn is pretty high. We
> have seen 20-30k tasks launch within 10 second window and 100k+ tasks
> running.
>
> In framework, event updates grow up to 250k, which leads to cascading
> effect on higher latency at Mesos Master (ack requests with 10s timeout) as
> well as blocking framework to process new since there are too many left to
> be acknowledged.
>
> Reconciliation is every 30 mins which also adds pressure on event stream if
> too many unacknowledged.
>
> I am thinking to experiment with default backoff period from 10s -> 30s or
> 60s, and simultaneously explore if dedup is an option.
>
> Thanks,
> Varun
>
> On Sun, Oct 28, 2018 at 6:49 PM Benjamin Mahler 
> wrote:
>
> > Hi Varun,
> >
> > What problem are you trying to solve precisely? There seems to be an
> > implication that the duplicate acknowledgements are expensive. They
> should
> > be low cost, so that's rather surprising. Do you have any data related to
> > this?
> >
> > You can also tune the backoff rate on the agents, if the defaults are too
> > noisy in your setup.
> >
> > Ben
> >
> > On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta  wrote:
> >
> > >
> > > Hi,
> > >>
> > >> Mesos agent will send status updates with exponential backoff until
> ack
> > >> is received.
> > >>
> > >> If processing of events at framework and sending ack to Master is
> > running
> > >> slow then it builds a back pressure at framework due to duplicate
> > updates
> > >> for same status.
> > >>
> > >> Has someone explored the option to dedup same status update event at
> > >> framework or is it even advisable to do. End goal is to dedup all
> events
> > >> and send only one ack back to Master.
> > >>
> > >> Thanks,
> > >> Varun
> > >>
> > >>
> > >>
> >
>


Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Benjamin Mahler
Hi Varun,

What problem are you trying to solve precisely? There seems to be an
implication that the duplicate acknowledgements are expensive. They should
be low cost, so that's rather surprising. Do you have any data related to
this?

You can also tune the backoff rate on the agents, if the defaults are too
noisy in your setup.

Ben

On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta  wrote:

>
> Hi,
>>
>> Mesos agent will send status updates with exponential backoff until ack
>> is received.
>>
>> If processing of events at framework and sending ack to Master is running
>> slow then it builds a back pressure at framework due to duplicate updates
>> for same status.
>>
>> Has someone explored the option to dedup same status update event at
>> framework or is it even advisable to do. End goal is to dedup all events
>> and send only one ack back to Master.
>>
>> Thanks,
>> Varun
>>
>>
>>


Re: LibProcess on windows

2018-10-19 Thread Benjamin Mahler
+andy

Some folks have been working on porting it to windows, they could provide
you with the latest status.

On Thu, Oct 18, 2018 at 1:48 PM Vaibhav Khanduja 
wrote:

> Has anybody used libprocess outside of mesos on windows? Thx
>


Re: Proposal: Adding health check definitions to master state output

2018-10-18 Thread Benjamin Mahler
> It's worth mentioning that I believe the original intention of the 'Task'
> message was to contain most information contained in 'TaskInfo', except
for
> those fields which could grow very large, like the 'data' field.

+1 all task / executor metadata should be exposed IMO. I look at the 'data'
field as a payload delivered to the executor / task rather than it being
part of the metadata. Based on this, if one wanted to have custom metadata
that gets exposed, labels would be used instead.

On Thu, Oct 18, 2018 at 3:21 PM Greg Mann  wrote:

> Hi all,
> In addition to the health check API change proposal that I recently sent
> out, we're considering adding a task's health check definition (when
> present) to the 'Task' protobuf message so that it appears in the master's
> '/state' endpoint response, as well as the v1 GET_STATE response and the
> TASK_ADDED event. This will allow operators to detect the presence and
> configuration of health checks on tasks via the operator API, which they
> are currently unable to do:
>
> message Task {
>   . . .
>
>   optional HealthCheck health_check = 15;
>
>   . . .
> }
>
> I wanted to check in with the community regarding this change, since for
> very large clusters it could have a non-negligible impact on the size of
> the master's state output.
>
> It's worth mentioning that I believe the original intention of the 'Task'
> message was to contain most information contained in 'TaskInfo', except for
> those fields which could grow very large, like the 'data' field.
>
> Please reply if you foresee this change having a negative impact on your
> deployments, or if you have any other thoughts/concerns!
>
> Thanks,
> Greg
>


[Performance WG] Meeting today

2018-10-17 Thread Benjamin Mahler
Hi folks,

I didn't get a chance to send out an agenda for this meeting, and it looks
like only Chun-Hung joined, so let's just do this over email instead.

Since the last meeting, we landed the copy-on-write Resources optimization
and we landed the fixes to sorter performance. The blog post highlighting
the 1.7.x performance improvements was published as well, take a look!

https://twitter.com/ApacheMesos/status/1049740950359044096

Some recent performance findings:

Docker containerizer actor can get backlogged with large number of
containers:
https://issues.apache.org/jira/browse/MESOS-9283

/stateSummary spends time (unnecessarily) coalescing ranges:
https://issues.apache.org/jira/browse/MESOS-9297

Priority wise, assessing v1 scheduler and operator API performance would be
the best use of time IMHO. If anyone wants to help out here it would be
most welcome.

Ben


Re: monitoring mesos master load

2018-10-12 Thread Benjamin Mahler
The following are probably what you're looking for:

https://issues.apache.org/jira/browse/MESOS-9237
https://issues.apache.org/jira/browse/MESOS-9236

On Fri, Oct 12, 2018 at 12:02 PM Eric Chung  wrote:

> Hello devs,
>
> We recently had an incident where the master was overloaded by the
> scheduler's ACKNOWLEDGE requests, causing the http api latencies to spike.
> I have two questions:
> - what is the best way to instrument the http api to emit latency metrics?
> - what's the best way to monitor the master's load, in addition to the api
> latencies?
>
> apparently monitoring cpu doesn't help much as the master will never
> saturate a machine with more than 2 cpus. any guidance on this would be
> much appreciated.
>
> Thanks!
> Eric
>


Re: Mesos Flakiness Statistics

2018-10-12 Thread Benjamin Mahler
Thanks for sending this Benno! I for one would love to see more regular
communication about the state of CI, especially so that I know how I can
help fix tests (right now I don't know which flaky tests are in areas I am
maintaining).

Is there any reason the first portion of the test name is being truncated?
For example, ResourceStatistics matches several tests:

$ grep -R ' ResourceStatistics)' src/tests
src/tests/containerizer/xfs_quota_tests.cpp:TEST_F(ROOT_XFS_QuotaTest,
ResourceStatistics)
src/tests/slave_recovery_tests.cpp:TEST_F(MesosContainerizerSlaveRecoveryTest,
ResourceStatistics)
src/tests/disk_quota_tests.cpp:TEST_F(DiskQuotaTest, ResourceStatistics)

Did we actually fix the flaky tests or did we disable them? I see only 22
disabled tests, which is better than I expected, but I hope there's good
tracking on getting these un-disabled again:

$ grep -R DISABLED src/tests | grep -v DISABLED_ON_WINDOWS | grep -v
NestedQuota | grep -v ChildRole | grep -v NestedRoles | grep -v
environment.cpp | wc -l
  22

On Fri, Oct 12, 2018 at 7:38 AM Benno Evers  wrote:

> Hey all,
>
> as you might know, we've set up an internal CI system that is running `make
> check` on a variety of different platforms and configurations, 16 in total.
>
> As we've experienced more and more pain maintaining a green master, I've
> compiled some statistics about which tests are most flaky. I thought other
> people might also be interested to have a look at that data:
>
> Last Week:
>
> # CI Statistics since 2018-10-05 14:22:35.422882 for branches
> containing 'asf/master'
> Total: 41 failing tests, 28 unique. (avg 0.14236111 failing tests
> per build)
>
> Top 5 failing tests:
> 6x: [empty]
> 4x: ResourceStatistics
> 2x: CreateDestroyDiskRecovery
> 2x: INTERNET_CURL_InvokeFetchByName
> 2x: RecoverNestedContainer
>
> Last Month:
>
> # CI Statistics since 2018-09-12 14:23:36.272031 for branches
> containing 'asf/master'
> Total: 320 failing tests, 75 unique. (avg 0.285714285714 failing tests
> per build)
>
> Top 5 failing tests:
> 57x: Used
> 32x: LongLivedDefaultExecutorRestart
> 27x: PythonFramework
> 23x: ROOT_CGROUPS_LaunchNestedContainerSessionsInParallel
> 22x: ResourceStatistics
>
> Last year:
>
> # CI Statistics since 2017-10-12 14:24:31.639792 for branches
> containing 'asf/master'
> Total: 3045 failing tests, 225 unique. (avg 0.184054642166 failing
> tests per build)
>
> Top 5 failing tests:
> 292x: [empty]
> 272x: ROOT_LOGROTATE_UNPRIVILEGED_USER_RotateWithSwitchUserTrueOrFalse
> 136x: LOGROTATE_RotateInSandbox
> 136x: LOGROTATE_CustomRotateOptions
> 131x: ResourceStatistics
>
>
> I don't really have a point with all of this, but some observations:
>  - [empty] means that the `mesos-tests` binary crashed
>  - The data also includes "real", i.e. non-flaky test failures, but they
> should not appear in the top 5 lists because we would hopefully either
> revert or fix them before they can accumulate dozens of failures
>  - Over the whole year, we seem to be pretty good at fixing  the nastiest
> flakes, with only one of the top 5 still appearing in this weeks test
> results
>  - Sadly, the fail percentage isn't as different between now and then as we
> might have hoped.
>
> Hope this was interesting, and best regards,
> --
> Benno Evers
> Software Engineer, Mesosphere
>


Re: Adding support for implicit allocation of mandatory custom resources in Mesos

2018-10-11 Thread Benjamin Mahler
Thanks for the thorough explanation.

Yes, it sounds acceptable and useful for assigning disk i/o and network
i/o. The error case of there not being enough resources post-injection
seems unfortunate but I don't see a way around it.

Can you file a ticket with this background?

On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud 
wrote:

> Hello,
>
> TL;DR; we have added network bandwidth as a first-class resource in our
> clusters with a custom isolator and we have patched Mesos master to
> introduce the concept of implicit allocation of custom resources to make
> network bandwidth mandatory for all tasks. I'd like to know what you think
> about what we have implemented and if you think we could introduce a new
> hook with the aim of injecting mandatory custom resources to tasks in Mesos
> master.
>
>
> At Criteo we have implemented a custom solution in our Mesos clusters to
> prevent network noisy neighbors and to allow our users to define a custom
> amount of reserved network bandwidth per application. Please note we run
> our clusters on a flat network and we are not using any kind of network
> overlay.
>
> In order to address these use cases, we enabled the `net_cls` isolator and
> wrote an isolator using tc, conntrack and iptables, each container having a
> dedicated custom reserved amount of network bandwidth declared by
> configuration in Marathon or Aurora.
>
> In the first implementation of our solution, the resources were not
> declared in the agents and obviously not taken into account by Mesos but
> the isolator allocated an amount of network bandwidth for each task
> relative to the number of reserved CPUs and the number of available CPUs on
> the server. Basically, the per container network bandwidth limitation was
> applied but Mesos was not aware of it. Using the CPU as a proxy for the
> amount of network bandwidth protected us from situations where an agent
> could allocate more network bandwidth than available on the agent. However,
> this model reached its limits when we introduced big consumers of network
> bandwidth in our clusters. They had to raise the number of CPUs to get more
> network bandwidth and therefore it introduced scheduling issues.
>
> Hence, we decided to leverage Mesos custom resources to let our users
> declare their requirements but also to decouple network bandwidth from CPU
> to avoid scheduling issues. We first declared the network bandwidth
> resource on every Mesos agents even if tasks were not declaring any. Then,
> we faced a first issue: the lack of support of network bandwidth and/or
> custom resources in Marathon and Aurora (well it seems in most frameworks
> actually). This led to a second issue: we needed Mesos to account for the
> network bandwidth of all tasks even if some frameworks were not supporting
> it yet. Solving the second problem allowed us to run a smooth migration by
> patching frameworks independently in a second phase.
>
> On the way we found out that the “custom resources” system wasn’t meeting
> our needs, because it only allows for “optional resources”, and not
> “mandatory resources” (resources that should be accounted for all tasks in
> a cluster, even if not required explicitly, like CPU, RAM or disk space.
> Good candidates are network bandwidth or disk I/O for instance).
>
> To enforce the usage of network bandwidth across all tasks we wanted to
> allocate an implicit amount of network bandwidth to tasks not declaring any
> in their configuration. One possible implementation was to make the Mesos
> master automatically compute the allocated network bandwidth for the task
> when the offer is accepted and subtract this amount from the overall
> available resources in Mesos. We consider this implicit allocation as a
> fallback mechanism for frameworks not supporting "mandatory" resources.
> Indeed, in a proper environment all frameworks would support these
> mandatory resources. Unfortunately, adding support for a new resource (or
> for custom resources) in all frameworks might not be manageable in a timely
> manner, especially in an ecosystem with multiple frameworks.
>
> Consequently, we wrote a patch in Mesos master to allocate an implicit
> amount of network bandwidth when it is not provided in the TaskInfo. In our
> case this implicit amount is computed based on the following Criteo
> specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`.
>
> Here is what happened when our frameworks were not supporting network
> bandwidth yet: offers were sent to frameworks and they accepted or rejected
> them regardless of network bandwidth available on the slave. When an offer
> was accepted, the TaskInfo sent by the framework obviously did not contain
> any network bandwidth but Mesos master implicitly injected some and let the
> task follow its way. There were two cases then: either the slave had enough
> resources to run the task and it was scheduled as expected or it did not
> have enough resources and the task failed to 

Re: Vote now for MesosCon 2018 proposals!

2018-09-25 Thread Benjamin Mahler
Voted! Thanks Jörg and the PC!

On Thu, Sep 20, 2018 at 9:51 AM Jörg Schad  wrote:

> Dear Mesos Community,
>
> Please take a few minutes over the next few days and review what members
> of the community have submitted for MesosCon 2018
>  (which will be held in San Francisco between
> November 5th-7th)!
> To make voting easier, we structured the voting following the different
> tracks.
> Please visit the following links and submit your responses. Look through
> as few or as many talks as you'd like to, and give us your feedback on
> these talks.
>
> Core Track: https://www.surveymonkey.com/r/mesoscon18-core
> Ecosystem Track: https://www.surveymonkey.com/r/mesoscon18-ecosystem
> DC/OS Track: https://www.surveymonkey.com/r/mesoscon18-dcos
> Frameworks Track: https://www.surveymonkey.com/r/mesoscon18-frameworks
> Operations Tracks: https://www.surveymonkey.com/r/mesoscon18-operations
> Misc Track: https://www.surveymonkey.com/r/mesoscon18-misc
> User Track: https://www.surveymonkey.com/r/mesoscon18-users
>
> Please submit your votes until Wednesday, Sept 26th 11:59 PM PDT, so you
> have one week to vote and make your voice heard!
>
> Thank you for your help and looking forward to a great MesosCon!
> Your MesosCon PC
>


Re: Differing DRF flavors over roles and frameworks

2018-09-24 Thread Benjamin Mahler
Filed https://issues.apache.org/jira/browse/MESOS-9255 to make this
consistent.

On Thu, Nov 30, 2017 at 12:27 PM, Benjamin Mahler  wrote:

>
>
> On Thu, Nov 30, 2017 at 2:52 PM, Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
>> Hi Ben,
>>
>> and thank you for answering.
>>
>> > > For frameworks in the same role on the other hand we choose to
>> normalize
>> > > with the allocated resources
>> >
>> > Within a role, the framework's share is evaluated using the *role*'s
>> total
>> > allocation as a denominator. Were you referring to the role's total
>> > allocation when you said "allocated resources"?
>>
>> Yes.
>>
>> > I believe this was just to reflect the "total pool" we're sharing
>> within.
>> > For roles, we're sharing the total cluster as a pool. For frameworks
>> within
>> > a role, we're sharing the role's total allocation as a pool amongst the
>> > frameworks. Make sense?
>>
>> Looking at the allocation loop, I see that while a role sorter uses the
>> actual cluster resources when generating a sorting, we only seem to
>> update the total in the picked framework sorter with an `add` at the end
>> of the allocation loop, so at the very least the "total pool" of
>> resources in a single role seems to lag. Should this update be moved to
>> the top of the loop?
>>
>
> Yes, the role's allocation pool should be adjusted before the framework
> sorting.
>
>
>>
>> > The sort ordering should be the same no matter which denominator you
>> > choose, since everyone gets the same denominator. i.e. 1,2,3 are ordered
>> > the same whether you're evaluating their share as 1/10,2/10,3/10 or
>> > 1/100,2/100,3/100, etc.
>>
>> This seems to be only true if we have just a single resource kind. For
>> multiple resource kinds we are not just dealing with a single scale
>> factor, but will also end up comparing single-resource scales against
>> each other in DRF.
>>
>> Here's a brief example of a cluster with two frameworks where we end up
>> with different DRF weights `f` depending on whether the frameworks are in
>> the same role or not.
>>
>> - Setup:
>>   * cluster total: cpus:40; mem:100; disk:1000
>>   * cluster used:  cpus:30; mem:  2; disk:   5
>>
>>   * framework 'a': used=cpus:20; mem:1; disk:1
>>   * framework 'b': used=cpus:10; mem:1; disk:4
>>
>> - both frameworks in separate roles
>>   * framework 'a', role 'A'; role shares: cpus:2/4; mem:1/100;
>> disk:1/1000; f=2/4
>>   * framework 'b', role 'B'; role shares: cpus:1/4; mem:1/100;
>> disk:2/1000; f=1/4
>>
>> - both frameworks in same role:
>>   * framework 'a': framework shares: cpus:2/3; mem:1/2; disk:1/4; f=1/2
>>   * framework 'b': framework shares: cpus:1/3; mem:1/2; disk:4/5; f=4/5
>>
>> If each framework is in its own role we would allocate the next resource
>> to 'b'; if the frameworks are in the same role we would allocate to 'a'
>> instead. This is what I meant with
>>
>> > It appears to me that by normalizing with the used resources inside a
>> role
>> > we somehow bias allocations inside a role against frameworks with
>> “unusual”
>> > usage vectors (relative to other frameworks in the same role).
>>
>> In this example we would penalize 'b' for having a usage vector very
>> different from 'a' (here: along the `disk` axis).
>>
>
> Ah yes, thanks for clarifying. The role's allocation pool that the
> frameworks in that role are sharing may have a different ratio of resources
> compared to the total cluster.
>
>
>>
>>
>> Benjamin
>>
>
>


[Performance WG] Meeting Notes - September 19

2018-09-20 Thread Benjamin Mahler
Thanks to those who joined: Yan Xu, Chun-Hung Hsiao, Meng Zhu, Carl Dellar

Notes:

(1) I forgot to mention during the meeting that more progress has happened
on the parallel reads of master state for the other read-only endpoints.
Alex or Benno can reply to this thread to provide an update. [1]

(2) Work is ongoing to improve allocation cycle performance [2]:

  (a) The patches for making the Resources wrapper copy on write are ready
to land [3]. These improve the performance of common filtering operations.
Meng presented some allocation cycle data based on this:

https://docs.google.com/spreadsheets/d/1GmBdialteknPDf8IdumzPbF4bGmu7
5mVHIpiXLf3xHc

The data shows that copy on write Resources improves allocation cycle time
significantly, but as there are more frameworks, the Sorter starts to
dominate the time spent in the allocation cycle and the relative benefit
decreases.

  (b) To improve allocation cycle time further by addressing the sorter
performance issues, I sent out / will send out a few patches [4]. The two
that provide the most benefit are: introducing an efficient
ScalarResourceQuantities type to make sort itself faster, and avoiding
dirtying the sorter upon allocation so that the allocation cycle doesn't
have to keep re-sorting. The latter requires an additional change to update
the usage of framework sorters so that the total they use are the entire
cluster rather than the role allocation.

(3) There's also been significant improvements to the master's offer
fan-out path [5]. We don't yet have a benchmark for this, but I'll try to
demonstrate the improvement in 1.8.

(4) Meng showed a new allocator benchmark test fixture that Kapil worked on
that makes it easier to get a "cluster" set up with a particular
configuration to make it easier to measure allocator scenarios of interest
[6].

(5) We chatted briefly about the master's call ingestion performance,
there's a benchmark [7] that uses the reconciliation call to send a big
message and Ilya looked into the results some time ago, but we should
revisit and gather performance data.

(6) I'm nearly done with the 1.7.0 performance blog post, just waiting on
some data from Alex / Benno.

Agenda Doc: https://docs.google.com/document/d/
12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU

Ben

[1] https://issues.apache.org/jira/browse/MESOS-9158
[2] https://issues.apache.org/jira/browse/MESOS-9087
[3] https://issues.apache.org/jira/browse/MESOS-6765
[4] https://issues.apache.org/jira/browse/MESOS-9239
[5] https://issues.apache.org/jira/browse/MESOS-9234
[6] https://issues.apache.org/jira/browse/MESOS-9187
[7]
https://github.com/apache/mesos/blob/1.7.0/src/tests/scheduler_tests.cpp#L2164-L2230


[Performance WG] Meeting Reminder: September 19

2018-09-19 Thread Benjamin Mahler
Just a reminder that the meeting today will go ahead as planned.

There are no guest presentations this time around, but as usual we'll go
over the exciting work that's been happening lately. Feel free to add items
to the agenda.

I will send out notes afterwards as usual.

Ben


[Performance WG] Meeting Notes - August 15

2018-08-15 Thread Benjamin Mahler
For folks that missed it, here are my notes. Thanks to jie for presenting!

(1) Jie presented a containerization benchmark:

https://reviews.apache.org/r/68266/

The motivation to add this was the mount table read issue that came up
originally in MESOS-8418 [1]. We only pushed a short term fix for container
metrics requests, and the performance of recovering / launching containers
is still affected by it until MESOS-9081 [2] is fixed.

The benchmark launches N containers, then destroys them and waits for them
to terminate. Running it against a 24 core (48 hyperthreaded core) machine
for 1000 containers showed roughly a minute for getting 1000 containers
launched and a similar amount of time to get them destroyed. We haven't
spent time optimizing this so there should be room for improvement.

It will be interesting to extend this to compare apples to apples with the
numbers in:

https://kubernetes.io/blog/2018/05/24/kubernetes-containerd-integration-goes-ga/#pod-startup-latency

Please share feedback on the review if you have any thoughts!


(2) I gave an update on other performance work:

  (a) State serving improvements: The parallel serving for the /state
endpoint is committed and will land in 1.7.0. There's still more to do here
to improve performance by avoiding an extra trip through the master's queue
during authorization [3], as well as migrate all reads over to parallel
serving.

  (b) There were some authentication scalability issues discovered during a
scale test. Fixes are underway. [4] [5] [6] [7]

  (c) There remains still some ongoing performance improvements to the
allocator. Recently, range subtraction for ports resources was improved
[8]. The main focus right now is additional copy elimination of Resources,
and the approach that we're attempting is to make Resources copy-on-write
[9]. Regardless, 1.7.0 does already include some significant improvements
to allocation cycle performance and I'll be sure to cover it in the blog
post.


(3) Lastly, we had a very brief discussion about jemalloc. James' agreed to
share on the existing thread some more details about potential
compatibility issues if we were to link our libraries against it.


Agenda Doc:
https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy
4bodagrlNGCuQU


[1] https://issues.apache.org/jira/browse/MESOS-8418
[2] https://issues.apache.org/jira/browse/MESOS-9081
[3] https://issues.apache.org/jira/browse/MESOS-9082
[4] https://issues.apache.org/jira/browse/MESOS-9144
[5] https://issues.apache.org/jira/browse/MESOS-9145
[6] https://issues.apache.org/jira/browse/MESOS-9146
[7] https://issues.apache.org/jira/browse/MESOS-9147
[8] https://issues.apache.org/jira/browse/MESOS-9086
[9] https://issues.apache.org/jira/browse/MESOS-6765


[Performance WG] Meeting Reminder: August 15

2018-08-14 Thread Benjamin Mahler
Just a reminder that the meeting tomorrow will go ahead as planned.

Jie graciously agreed to discuss a container launch benchmark that he's
been working on; should be interesting!

There are a few more topics up for discussion, and feel free to add to the
agenda. I will send out notes afterwards as usual.

Ben


Re: [VOTE] Release Apache Mesos 1.4.2 (rc1)

2018-08-13 Thread Benjamin Mahler
+1 (binding)

make check passes on macOS 10.13.6 with Apple LLVM version 9.1.0
(clang-902.0.39.2).

Thanks Kapil!

On Wed, Aug 8, 2018 at 3:06 PM, Kapil Arya  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.4.2.
>
> 1.4.2 is a bug fix release. The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.4.2-rc1
>
> The candidate for Mesos 1.4.2 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/mesos-1.4.2.tar.gz
>
> The tag to be voted on is 1.4.2-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.2-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
> mesos-1.4.2.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
> mesos-1.4.2.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1231
>
> Please vote on releasing this package as Apache Mesos 1.4.2!
>
> The vote is open until Sat Aug 11 11:59:59 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.4.2
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Anand and Kapil
>


Re: [VOTE] Release Apache Mesos 1.4.2 (rc1)

2018-08-13 Thread Benjamin Mahler
This was fixed in
https://github.com/apache/mesos/commit/02ad5c8cdd644ee8eec83bf887daa98bb163637d,
I don't recall there being any issues due to it.

On Mon, Aug 13, 2018 at 4:50 PM, Benjamin Mahler  wrote:

> Hm.. I ran make check on macOS saw the following:
>
> [ RUN  ] AwaitTest.AwaitSingleDiscard
> src/tests/collect_tests.cpp:275: Failure
> Value of: promise.future().hasDiscard()
>   Actual: false
> Expected: true
> [  FAILED  ] AwaitTest.AwaitSingleDiscard (0 ms)
>
> On Wed, Aug 8, 2018 at 3:06 PM, Kapil Arya  wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.4.2.
>>
>> 1.4.2 is a bug fix release. The CHANGELOG for the release is available at:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain
>> ;f=CHANGELOG;hb=1.4.2-rc1
>>
>> The candidate for Mesos 1.4.2 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/mesos-1.4.2.tar.gz
>>
>> The tag to be voted on is 1.4.2-rc1:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.2-rc1
>>
>> The SHA512 checksum of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/mesos
>> -1.4.2.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/mesos
>> -1.4.2.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1231
>>
>> Please vote on releasing this package as Apache Mesos 1.4.2!
>>
>> The vote is open until Sat Aug 11 11:59:59 PDT 2018 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.4.2
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Anand and Kapil
>>
>
>


Re: [VOTE] Release Apache Mesos 1.4.2 (rc1)

2018-08-13 Thread Benjamin Mahler
Hm.. I ran make check on macOS saw the following:

[ RUN  ] AwaitTest.AwaitSingleDiscard
src/tests/collect_tests.cpp:275: Failure
Value of: promise.future().hasDiscard()
  Actual: false
Expected: true
[  FAILED  ] AwaitTest.AwaitSingleDiscard (0 ms)

On Wed, Aug 8, 2018 at 3:06 PM, Kapil Arya  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.4.2.
>
> 1.4.2 is a bug fix release. The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.4.2-rc1
>
> The candidate for Mesos 1.4.2 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/mesos-1.4.2.tar.gz
>
> The tag to be voted on is 1.4.2-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.2-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
> mesos-1.4.2.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
> mesos-1.4.2.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1231
>
> Please vote on releasing this package as Apache Mesos 1.4.2!
>
> The vote is open until Sat Aug 11 11:59:59 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.4.2
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Anand and Kapil
>


Re: Using jemalloc as default allocator

2018-08-13 Thread Benjamin Mahler
I would be interested in knowing what other projects have done around this
(e.g. Rust, Redis seem to use it by default on Linux, and I see ongoing
discussion in other projects e.g. Ruby).

To James' point, while anyone technically can use jemalloc, I have only
seen 1 user doing it. Maybe there are more, but it certainly isn't
widespread. Sometimes our users don't even know about whether they're
running an optimized build, they just install a package and run it.

We should aim to do something better by default for users, if possible, no?

Also, james, which profiling support are you referring to in the context of
jemalloc? Are you saying that memory profiling works without it? Right now
as far as I know, we can't use memory profiling when trying to investigate
user issues because nearly no one is using jemalloc.

On Mon, Aug 13, 2018 at 3:48 AM Benno Evers  wrote:

> Ok then, let's not do it for now.
>
> Best regards,
>
> On Fri, Aug 10, 2018 at 6:10 PM, James Peach  wrote:
>
>>
>>
>> > On Aug 10, 2018, at 8:56 AM, Benno Evers  wrote:
>> >
>> > Hi guys,
>> >
>> > it's quite late in the release cycle, but I've been thinking about
>> > proposing to enable the `--enable-jemalloc-allocator` configuration
>> setting
>> > by default for linux builds of Mesos.
>> >
>> > The thinking is that
>> > - Benchmarks consistently show a small to medium performance improvement
>> > - The bundled jemalloc version (5.0.1) has been released as stable for
>> > over a year and has not seen any severe bugs
>> > - Our own Mesos builds with jemalloc don't show any issues so far
>> >
>> > What do you think?
>>
>> I don't think it's worth it. Anyone who wants to use jemalloc can already
>> do it, and the Mesos profiling support works nicely without also forcing a
>> build-time dependency. In general, I feel that bundling dependencies is a
>> burden on the build (e.g out bundled jemalloc is already out of date).
>>
>> J
>
>
>
>
> --
> Benno Evers
> Software Engineer, Mesosphere
>


Re: Backport Policy

2018-07-26 Thread Benjamin Mahler
uch cases to a
>>> release
>>> > >> manager in my opinion will help us enforce the strategy of minimal
>>> > number
>>> > >> backports. As a bonus, the release manager will have a much better
>>> > >> understanding of what's going on with the release, keyword: "more
>>> > >> ownership".
>>> > >>
>>> > >> On Sat, Jul 14, 2018 at 12:07 AM, Andrew Schwartzmeyer <
>>> > >> and...@schwartzmeyer.com> wrote:
>>> > >>
>>> > >>> I believe I fall somewhere between Alex and Ben.
>>> > >>>
>>> > >>> As for deciding what to backport or not, I lean toward Alex's view
>>> of
>>> > >>> backporting as little as possible (and agree with his criteria). My
>>> > >>> reasoning is that all changes can have unforeseen consequences,
>>> which I
>>> > >>> believe is something to be actively avoided in already released
>>> > versions.
>>> > >>> The reason for backporting patches to fix regressions is the same
>>> as
>>> > the
>>> > >>> reason to avoid backporting as much as possible: keep behavior
>>> > consistent
>>> > >>> (and safe) within a release. With that as the goal of a branch in
>>> > >>> maintenance mode, it makes sense to fix regressions, and make
>>> > exceptions to
>>> > >>> fix CVEs and other critical/blocking issues.
>>> > >>>
>>> > >>> As for who should decide what to backport, I lean toward Ben's
>>> view of
>>> > >>> the burden being on the committer. I don't think we should add more
>>> > work
>>> > >>> for release managers, and I think the committer/shepherd obviously
>>> has
>>> > the
>>> > >>> most understanding of the context around changes proposed for
>>> backport.
>>> > >>>
>>> > >>> Here's an example of a recent bugfix which I backported:
>>> > >>> https://reviews.apache.org/r/67587/ (for MESOS-3790)
>>> > >>>
>>> > >>> While normally I believe this change falls under "avoid due to
>>> > >>> unforeseen consequences," I made an exception as the bug was old,
>>> circa
>>> > >>> 2015, (indicating it had been an issue for others), and was causing
>>> > >>> recurring failures in testing. The fix itself was very small,
>>> meaning
>>> > it
>>> > >>> was easier to evaluate for possible side effects, so I felt a
>>> little
>>> > safer
>>> > >>> in that regard. The effect of not having the fix was a fatal and
>>> > undesired
>>> > >>> crash, which furthermore left troublesome side effects on the
>>> system
>>> > (you
>>> > >>> couldn't bring the agent back up). And lastly, a dependent project
>>> > (DC/OS)
>>> > >>> wanted it in their next bump, which necessitated backporting to the
>>> > release
>>> > >>> they were pulling in.
>>> > >>>
>>> > >>> I think in general we should backport only as necessary, and leave
>>> it
>>> > on
>>> > >>> the committers to decide if backporting a particular change is
>>> > necessary.
>>> > >>>
>>> > >>>
>>> > >>> On 07/13/2018 12:54 am, Alex Rukletsov wrote:
>>> > >>>
>>> > >>>> This is exactly where our views differ, Ben : )
>>> > >>>>
>>> > >>>> Ideally, I would like a release manager to have more ownership and
>>> > less
>>> > >>>> manual work. In my imagination, a release manager has more power
>>> and
>>> > >>>> control about dates, features, backports and everything that is
>>> > related
>>> > >>>> to
>>> > >>>> "their" branch. I would also like us to back port as little as
>>> > >>>> possible, to
>>> > >>>> simplify testing and releasing patch versions.
>>> > >>>>
>>> > >>>> On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler <
>>> bmah...@apache.org>
&g

[Performance WG] Meeting Notes - July 18

2018-07-18 Thread Benjamin Mahler
For folks that missed it, here are my own notes. Thanks to alexr and dario
for presenting!

(1) I discussed a high agent cpu usage issue when hitting the /containers
endpoint:

https://issues.apache.org/jira/browse/MESOS-8418

This was resolved, but it didn't get attention for months until I noticed a
recent complaint about it in slack. It highlights the need to periodically
check for new performance tickets in the backlog.


(2) alexr presented slides on some ongoing work to improve the state
serving performance:

https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g

This included measurements from clusters with many frameworks. The short
term plan (hopefully in 1.7.0) is to investigate batching / parallel
processing of state requests (still on the master actor), and halving the
queueing time via authorizing outside of the master actor. There are
potential longer term plans, but these short term improvements should take
us pretty far, along with (3).


(3) I presented some results from adapting our jsonify library to use
rapidjson under the covers, and it cuts our state serving time in half:

https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo

The code is mainly done but there are a few things left to get it in a
reviewable state.


(4) I briefly mentioned some various other performance work:

  (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some
benchmarking and improvements were made to better handle a large number of
metrics, in support of per-framework metrics:

https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets)

There's still more open work that can be done here, but a more critical
user-facing improvement at this point is the migration to push gauges in
the master and allocator:

https://issues.apache.org/jira/browse/MESOS-8914

  (b) JSON parsing cost was cut in half by avoiding conversion through an
intermediate format and instead directly parsing into our data structures:

https://issues.apache.org/jira/browse/MESOS-9067


(5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on
benchmarking and making performance improvements to the allocator to speed
up allocation cycle time and to address "offer starvation". In our
multi-framework scale testing we saw allocation cycle time go down from 15
secs to 5 secs, and there's still lots of low hanging fruit:

https://issues.apache.org/jira/browse/MESOS-9087

For offer starvation, we fixed an offer fragmentation issue due to quota
"chopping" and we introduced the choice of a random weighted shuffle sorter
as an alternative to ensure that high share frameworks don't get starved.
We may also investigate introducing a round-robin sorter that shuffles
between rounds if needed:

https://issues.apache.org/jira/browse/MESOS-8935
https://issues.apache.org/jira/browse/MESOS-8936


(6) Dario talked about the MPSC queue that was recently added to libprocess
for use in Process event queues. This needs to be enabled at configure-time
as is currently the case for the lock free structures, and should provide a
throughput improvement to libprocess. We still need to chart a path to
turning these libprocess performance enhancing features on by default.


(7) I can draft a 1.7.0 performance improvements blog post that features
all of these topics and more. We may need to pull out some of the more
lengthy content into separate blog posts if needed, but I think from the
user perspective, highlighting what they get in 1.7.0 performance wise will
be nice.

Agenda Doc:
https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU

Ben


Reminder: Performance Working Group Meeting July 18 10AM PST

2018-07-17 Thread Benjamin Mahler
Hi folks, just a reminder that there is indeed a performance working group
meeting tomorrow.

We'll discuss what's been going on recently in the performance area, and
there's a lot to discuss! I will send out some detailed notes to the
mailing lists afterwards.

Ben


Re: Backport Policy

2018-07-12 Thread Benjamin Mahler
+user, I probably it would be good to hear from users as well.

Please see the original proposal as well as Alex's proposal and let us know
your thoughts.

To continue the discussion from where Alex left off:

> Other bugs and significant improvements, e.g., performance, may be back 
> ported,
the release manager should ideally be the one who decides on this.

I'm a little puzzled by this, why is the release manager involved? As we
already document, backports occur when the bug is fixed, so this happens in
the steady state of development, not at release time. The release manager
only comes in at the time of the release itself, at which point all
backports have already happened and the release manager handles the release
process. Only blocker level issues can stop the release and while the
release manager has a strong say, we should generally agree on what
consists of a release blocking issue.

Just to clarify my workflow, I generally backport every bug fix I commit
that applies cleanly, right after I commit it to master (with the
exceptions I listed below).

On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov  wrote:

> I would like to back port as little as possible. I suggest the following
> criteria:
>
> * By default, regressions are back ported to existing release branches. A
> bug is considered a regression if the functionality is present in the
> previous minor or patch version and is not affected by the bug there.
>
> * Critical and blocker issues, e.g., a CVE, can be back ported.
>
> * Other bugs and significant improvements, e.g., performance, may be back
> ported, the release manager should ideally be the one who decides on this.
>
> On Thu, Jul 12, 2018 at 12:25 AM, Vinod Kone  wrote:
>
> > Ben, thanks for the clarification. I'm in agreement with the points you
> > made.
> >
> > Once we have consensus, would you mind updating the doc?
> >
> > On Wed, Jul 11, 2018 at 5:15 PM Benjamin Mahler 
> > wrote:
> >
> > > I realized recently that we aren't all on the same page with
> backporting.
> > > We currently only document the following:
> > >
> > > "Typically the fix for an issue that is affecting supported releases
> > lands
> > > on the master branch and is then backported to the release branch(es).
> In
> > > rare cases, the fix might directly go into a release branch without
> > landing
> > > on master (e.g., fix / issue is not applicable to master)." [1]
> > >
> > > This leaves room for interpretation about what lies outside of
> "typical".
> > > Here's the simplest way I can explain what I stick to, and I'd like to
> > hear
> > > what others have in mind:
> > >
> > > * By default, bug fixes at any level should be backported to existing
> > > release branches if it affects those releases. Especially important:
> > > crashes, bugs in non-experimental features.
> > >
> > > * Exceptional cases that can omit backporting: difficult to backport
> > fixes
> > > (especially if the bugs are deemed of low priority), bugs in
> experimental
> > > features.
> > >
> > > * Exceptional non-bug cases that can be backported: performance
> > > improvements.
> > >
> > > I realize that there is a ton of subtlety here (even in terms of which
> > > things are defined as bugs). But I hope we can lay down a policy that
> > gives
> > > everyone the right mindset for common cases and then discuss corner
> cases
> > > on-demand in the future.
> > >
> > > [1] http://mesos.apache.org/documentation/latest/versioning/
> > >
> >
>


Backport Policy

2018-07-11 Thread Benjamin Mahler
I realized recently that we aren't all on the same page with backporting.
We currently only document the following:

"Typically the fix for an issue that is affecting supported releases lands
on the master branch and is then backported to the release branch(es). In
rare cases, the fix might directly go into a release branch without landing
on master (e.g., fix / issue is not applicable to master)." [1]

This leaves room for interpretation about what lies outside of "typical".
Here's the simplest way I can explain what I stick to, and I'd like to hear
what others have in mind:

* By default, bug fixes at any level should be backported to existing
release branches if it affects those releases. Especially important:
crashes, bugs in non-experimental features.

* Exceptional cases that can omit backporting: difficult to backport fixes
(especially if the bugs are deemed of low priority), bugs in experimental
features.

* Exceptional non-bug cases that can be backported: performance
improvements.

I realize that there is a ton of subtlety here (even in terms of which
things are defined as bugs). But I hope we can lay down a policy that gives
everyone the right mindset for common cases and then discuss corner cases
on-demand in the future.

[1] http://mesos.apache.org/documentation/latest/versioning/


Re: Normalization of metric keys

2018-07-06 Thread Benjamin Mahler
Do we also want:

3. Has an unambiguous decoding.

Replacing '/' with '#%$' means I don't know if the user actually supplied
'#%$' or '/'. But using something like percent-encoding would have property
3.

On Fri, Jul 6, 2018 at 10:25 AM, Greg Mann  wrote:

> Thanks for the reply Ben!
>
> Yea I suspect the lack of normalization there was not intentional, and it
> means that you can no longer reliably split on '/' unless you apply some
> external controls to user input. Yep, this is bad :)
>
> One thing we should consider when normalizing metadata embedded in metric
> keys (like framework name/ID) is that operators will likely want to
> de-normalize this information in their metrics tooling. For example,
> ideally something like the 'mesos_exporter' [1] could expose the framework
> name/ID as tags which could be easily consumed by the cluster's metrics
> infrastructure.
>
> To accommodate de-normalization, any substitutions we perform while
> normalizing should be:
>
>1. Unique - we should substitute a single, unique string for each
>disallowed character
>2. Verbose - we should substitute strings which are unlikely to appear
>in user input. (Examples: 

Re: [Proposal] Replicated log storage compaction

2018-07-06 Thread Benjamin Mahler
I was chatting with Ilya on slack and I'll re-post here:

* Like Jie, I was hoping for a toggle (maybe it should start default off
until we have production experience? sounds like Ilya has already
experience with it running in test clusters so far)

* I was asking whether this would be considered a flaw in leveldb's
compaction algorithm. Ilya didn't see any changes in recent leveldb
releases that would affect this. So, we probably should file an issue to
see if they think it's a flaw and whether our workaround makes sense to
them. We can reference this in the code for posterity.

On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu  wrote:

> Sounds good to me.
>
> My only ask is to have a way to turn this feature off (flag, env var, etc)
>
> - Jie
>
> On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone  wrote:
>
>> I don't know about the replicated log, but the proposal seems find to me.
>>
>> Jie/BenM, do you guys have an opinion?
>>
>> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
>> sshanmug...@twitter.com.invalid> wrote:
>>
>>> +1. Aurora will hugely benefit from this change.
>>>
>>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin 
>>> wrote:
>>>
>>> > Hi everyone,
>>> >
>>> > I'd like to propose adding "manual" LevelDB compaction to the
>>> > replicated log truncation process.
>>> >
>>> > Motivation
>>> >
>>> > Mesos Master and Aurora Scheduler use the replicated log to persist
>>> > information about the cluster. This log is periodically truncated to
>>> > prune outdated log entries. However the replicated log storage is not
>>> > compacted and grows without bounds. This leads to problems like
>>> > synchronous failover of all master/scheduler replicas happening
>>> > because all of them ran out of disk space.
>>> >
>>> > The only time when log storage compaction happens is during recovery.
>>> > Because of that periodic failovers are required to control the
>>> > replicated log storage growth. But this solution is suboptimal.
>>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
>>> > storage which depending on the cluster can take several minutes.
>>> > During the downtime tasks cannot be (re-)scheduled and users cannot
>>> > interact with the service.
>>> >
>>> > Proposal
>>> >
>>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
>>> > work well with LevelDB background compaction algorithm. Fortunately,
>>> > LevelDB provides a way to force compaction with DB::CompactRange()
>>> > method. Replicated log storage can trigger it after persisting learned
>>> > TRUNCATE action and deleting truncated log positions. The compacted
>>> > range will be from previous first position of the log to the new first
>>> > position (the one the log was truncated up to).
>>> >
>>> > Performance impact
>>> >
>>> > Mesos Master and Aurora Scheduler have 2 different replicated log
>>> > usage profiles. For Mesos Master every registry update (agent
>>> > (re-)registration/marking, maintenance schedule update, etc.) induces
>>> > writing a complete snapshot which depending on the cluster size can
>>> > get pretty big (in a scale test fake cluster with 55k agents it is
>>> > ~15MB). Every snapshot is followed by a truncation of all previous
>>> > entries, which doesn't block the registrar and happens kind of in the
>>> > background. In the scale test cluster with 55k agents compactions
>>> > after such truncations take ~680ms.
>>> >
>>> > To reduce the performance impact for the Master compaction can be
>>> > triggered only after more than some configurable number of keys were
>>> > deleted.
>>> >
>>> > Aurora Scheduler writes incremental changes of its storage to the
>>> > replicated log. Every hour a storage snapshot is created and persisted
>>> > to the log, followed by a truncation of all entries preceding the
>>> > snapshot. Therefore, storage compactions will be infrequent but will
>>> > deal with potentially large number of keys. In the scale test cluster
>>> > such compactions took ~425ms each.
>>> >
>>> > Please let me know what you think about it.
>>> >
>>> > Thanks!
>>> >
>>> > --
>>> > Ilya Pronin
>>> >
>>>
>>
>


Re: [mesos-mail] Re: [Performance WG] Notes from meeting today

2018-07-03 Thread Benjamin Mahler
I just pushed some initial documentation for this, it will show up soon
next to the memory profiling link:

http://mesos.apache.org/documentation/latest/#administration

On Fri, May 25, 2018 at 6:13 PM, Benjamin Mahler  wrote:

> I'll write up some instructions with what I know so far and get it added
> to the website. In the meantime, here's what you need to do to generate a
> 60 second profile:
>
> $ sudo perf record -F 100 -a -g --call-graph dwarf -p 
> -- sleep 60
> $ sudo perf script --header | c++filt > mesos-master.stacks
> $ gzip mesos-master.stacks
> # Share the mesos-master.stacks.gz file for analysis.
>
> It seems that frame pointer omission is ok, as long as '--call-graph
> dwarf' is provided to perf. I don't yet know if frame pointers yield better
> traces than '--call-graph dwarf' without frame pointers.
>
> If you want to use flamescope yourself, follow the instructions here and
> put the unzipped file above into the 'examples' directory:
> https://github.com/Netflix/flamescope
>
> On Thu, May 17, 2018 at 4:51 PM, Zhitao Li  wrote:
>
>> Hi Ben,
>>
>> Thanks a lot, this is super informative.
>>
>> One question: will you write a blog/doc on how to generate flamescope
>> graphs from either a micro-benchmark, or a real cluster? Also, do you know
>> what configuration for compiling should be used to preserve proper debug
>> symbols for both Mesos and 3rdparty libraries?
>>
>> On Wed, May 16, 2018 at 5:44 PM, Benjamin Mahler 
>> wrote:
>>
>> > +Judith
>> >
>> > There should be a recording. Judith, do you know where they get posted?
>> >
>> > Benjamin, glad to hear it's useful, I'll continue doing it!
>> >
>> > On Wed, May 16, 2018 at 4:41 PM Gilbert Song 
>> > wrote:
>> >
>> > > Do we have the recorded video for this meeting?
>> > >
>> > > On Wed, May 16, 2018 at 1:54 PM, Benjamin Bannier <
>> > > benjamin.bann...@mesosphere.io> wrote:
>> > >
>> > > > Hi Ben,
>> > > >
>> > > > thanks for taking the time to edit and share these detailed notes.
>> > Being
>> > > > able to asynchronously see the great work folks are doing surfaced
>> is
>> > > > great, especially when put into context with thought like here.
>> > > >
>> > > >
>> > > > Benjamin
>> > > >
>> > > > > On May 16, 2018, at 8:06 PM, Benjamin Mahler 
>> > > wrote:
>> > > > >
>> > > > > Hi folks,
>> > > > >
>> > > > > Here are some notes from the performance meeting today.
>> > > > >
>> > > > > (1) First I did a demo of flamescope, you can find it here:
>> > > > > https://github.com/Netflix/flamescope
>> > > > >
>> > > > > It's a very useful tool, hopefully we can make it easier for
>> users to
>> > > > > generate the data that we can drop into flamescope when reporting
>> any
>> > > > > performance issues. One of the open questions is how `perf
>> > --call-graph
>> > > > > dwarf` compares to `perf -g` but with mesos compiled with frame
>> > > > pointers. I
>> > > > > haven't had time to check this yet.
>> > > > >
>> > > > > When playing with the tool, it was easy to find some hot spots in
>> the
>> > > > given
>> > > > > cluster I was looking at (which was not necessarily
>> representative).
>> > > For
>> > > > > the agent, jie filed:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/MESOS-8901
>> > > > >
>> > > > > And for the master, I noticed that metrics, state json generation
>> (no
>> > > > > surprise), and a particular spot in the allocator were very
>> > expensive.
>> > > > >
>> > > > > Metrics we'd like to address via migration to push gauges (Zhitao
>> has
>> > > > > offered to help with this effort):
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/MESOS-8914
>> > > > >
>> > > > > The state generation we'd like to address via streaming state
>> into a
>> > > > > separate actor (and providing filtering as well), this will get
>> > further
>> > > > > investigated / prioritized very soon:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/MESOS-8345
>> > > > >
>> > > > > (2) Kapil discussed benchmarks for the long standing "offer
>> > starvation"
>> > > > > issue:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/MESOS-3202
>> > > > >
>> > > > > I'll send out an email or document soon with some background on
>> this
>> > > > issue
>> > > > > as well as our options to address it.
>> > > > >
>> > > > > Let me know if you have any questions or feedback!
>> > > > >
>> > > > > Ben
>> > > >
>> > > > --
>> > > > You received this message because you are subscribed to the Google
>> > Groups
>> > > > "Apache Mesos Mail Lists" group.
>> > > > Visit this group at
>> > > https://groups.google.com/a/mesosphere.io/group/mesos-
>> > > > mail/.
>> > > > For more options, visit
>> > > https://groups.google.com/a/mesosphere.io/d/optout
>> > > > .
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Cheers,
>>
>> Zhitao Li
>>
>
>


Re: Normalization of metric keys

2018-07-03 Thread Benjamin Mahler
I don't think the lack of principal normalization was intentional. Why
spread that further? Don't we also have some normalization today?

Having slashes show up in components complicates parsing (can no longer
split on '/'), no? For example, if we were to introduce the ability to
query a subset of metrics with a simple matcher (e.g.
/frameworks/*/messages_received), then this would be complicated by the
presence of slashes in the principal or other user supplied strings.

On Tue, Jul 3, 2018 at 3:17 PM, Greg Mann  wrote:

> Hi all!
> I'm currently working on adding a suite of new per-framework metrics to
> help schedulers better debug unexpected/unwanted behavior (MESOS-8842
> ). One issue that has
> come up during this work is how we should handle strings like the framework
> name or role name in metric keys, since those strings may contain
> characters like '/' which already have a meaning in our metrics interface.
> I intend to place the framework name and ID in the keys for the new
> per-framework metrics, delimited by a sufficiently-unique separator so that
> operators can decode the name/ID in their metrics tooling. An example
> per-framework metric key:
>
> master/frameworks/###/tasks/task_running
>
>
> I recently realized that we actually already allow the '/' character in
> metric keys, since we include the framework principal in these keys:
>
> frameworks//messages_received
> frameworks//messages_processed
>
> We don't disallow any characters in the principal, so anything could
> appear in those keys.
>
> *Since we don't normalize the principal in the above keys, my proposal is
> that we do not normalize the framework name at all when constructing the
> new per-framework metric keys.*
>
>
> Let me know what you think!
>
> Cheers,
> Greg
>


Re: CHECK_NOTNULL(self->bev) Check failed inside LibeventSSLSocketImpl::shutdown

2018-06-27 Thread Benjamin Mahler
Can you also include the stack trace from the CHECK failure?

On Tue, Jun 26, 2018 at 11:25 PM, Suteng  wrote:

> F0622 11:22:30.985245 16127 libevent_ssl_socket.cpp:190] Check failed:
> 'self->bev' Must be non NULL
>
> Try LibeventSSLSocketImpl::shutdown(int how)
>
> CHECK_NOTNULL(self->bev)
>
>
>
> Test case:
>
> A server is non-ssl, B server is enable downgrade, B frequent link
> reconnect to A, then will generate this error, very low probability. It’s
> looks like bev is already free, than call shutdown again.
>
>
>
> class OpensslProcess : public ProtobufProcess
>
> {
>
> public:
>
>   OpensslProcess()
>
> : ProcessBase("OpensslProcess"), sendCnt(0), recvCnt(0) {}
>
>
>
>   ~OpensslProcess() {}
>
>
>
>   virtual void initialize()
>
>   {
>
>   install(::HandleMessage);
>
>   install("ping", ::pong);
>
>   //SendMessage();
>
>   }
>
>   void SendMessage()
>
>   {
>
> string data = "hello world";
>
> UdpMessage msg;
>
> msg.set_size(data.size());
>
> msg.set_data(data);
>
>
>
> Address serverAddr = Address(net::IP::parse("127.0.0.1",
> AF_INET).get(), 7012);
>
> UPID destUpid = UPID("OpensslProcess", serverAddr);
>
>
>
> send(destUpid, msg);
>
> sleep(5);
>
> link(destUpid, RemoteConnection::RECONNECT);
>
> send(destUpid, msg);
>
> link(destUpid, RemoteConnection::RECONNECT);
>
> send(destUpid, msg);
>
>   }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Su Teng  00241668
>
>
>
> Distributed and Parallel Software Lab
>
> Huawei Technologies Co., Ltd.
>
> Email:sut...@huawei.com
>
>
>
>
>


[Performance WG] June meeting canceled

2018-05-31 Thread Benjamin Mahler
Hi folks,

I will be out for most of June on vacation so I'm canceling this month's
performance working group meeting.

I was planning to bring up the kubernetes' kubelet benchmark:
https://kubernetes.io/blog/2018/05/24/kubernetes-containerd-integration-goes-ga/

Right now we don't have any agent-level container launching throughout /
scalability / overhead benchmarks so this would be very useful! It probably
will also identify some low hanging fruit on the agent. Let me know if
anyone is interested in doing this, I would be happy to help.

Ben


Re: [VOTE] Release Apache Mesos 1.3.3 (rc1)

2018-05-29 Thread Benjamin Mahler
+1 (binding)

Make check passes on macOS 10.13.4 with Apple LLVM version 9.1.0
(clang-902.0.39.1).

On Wed, May 23, 2018 at 10:00 PM, Michael Park  wrote:

> The tarball has been fixed, please vote now!
>
> 'twas BSD `tar` issues... :(
>
> Thanks,
>
> MPark
>
> On Wed, May 23, 2018 at 11:39 AM, Michael Park  wrote:
>
>> Huh... 樂 Super weird. I'll look into it.
>>
>> Thanks for checking!
>>
>> MPark
>>
>> On Wed, May 23, 2018 at 11:34 AM Vinod Kone  wrote:
>>
>>> It's empty for me too!
>>>
>>> On Wed, May 23, 2018 at 11:32 AM, Benjamin Mahler 
>>> wrote:
>>>
>>>> Thanks Michael!
>>>>
>>>> Looks like the tar.gz is empty, is it just me?
>>>>
>>>> On Tue, May 22, 2018 at 10:09 PM, Michael Park 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Please vote on releasing the following candidate as Apache Mesos 1.3.3.
>>>>>
>>>>> The CHANGELOG for the release is available at:
>>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>>>>> lain;f=CHANGELOG;hb=1.3.3-rc1
>>>>> 
>>>>> 
>>>>>
>>>>> The candidate for Mesos 1.3.3 release is available at:
>>>>> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/mesos
>>>>> -1.3.3.tar.gz
>>>>>
>>>>> The tag to be voted on is 1.3.3-rc1:
>>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit
>>>>> ;h=1.3.3-rc1
>>>>>
>>>>> The SHA512 checksum of the tarball can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/mesos
>>>>> -1.3.3.tar.gz.sha512
>>>>>
>>>>> The signature of the tarball can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/mesos
>>>>> -1.3.3.tar.gz.asc
>>>>>
>>>>> The PGP key used to sign the release is here:
>>>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>>>
>>>>> The JAR is up in Maven in a staging repository here:
>>>>> https://repository.apache.org/content/repositories/orgapachemesos-1226
>>>>>
>>>>> Please vote on releasing this package as Apache Mesos 1.3.3!
>>>>>
>>>>> The vote is open until Fri May 25 22:07:39 PDT 2018 and passes if a
>>>>> majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Mesos 1.3.3
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> Thanks,
>>>>>
>>>>> MPark
>>>>>
>>>>
>>>>
>>>
>


Re: [mesos-mail] Re: [Performance WG] Notes from meeting today

2018-05-25 Thread Benjamin Mahler
I'll write up some instructions with what I know so far and get it added to
the website. In the meantime, here's what you need to do to generate a 60
second profile:

$ sudo perf record -F 100 -a -g --call-graph dwarf -p  --
sleep 60
$ sudo perf script --header | c++filt > mesos-master.stacks
$ gzip mesos-master.stacks
# Share the mesos-master.stacks.gz file for analysis.

It seems that frame pointer omission is ok, as long as '--call-graph dwarf'
is provided to perf. I don't yet know if frame pointers yield better traces
than '--call-graph dwarf' without frame pointers.

If you want to use flamescope yourself, follow the instructions here and
put the unzipped file above into the 'examples' directory:
https://github.com/Netflix/flamescope

On Thu, May 17, 2018 at 4:51 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:

> Hi Ben,
>
> Thanks a lot, this is super informative.
>
> One question: will you write a blog/doc on how to generate flamescope
> graphs from either a micro-benchmark, or a real cluster? Also, do you know
> what configuration for compiling should be used to preserve proper debug
> symbols for both Mesos and 3rdparty libraries?
>
> On Wed, May 16, 2018 at 5:44 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
> > +Judith
> >
> > There should be a recording. Judith, do you know where they get posted?
> >
> > Benjamin, glad to hear it's useful, I'll continue doing it!
> >
> > On Wed, May 16, 2018 at 4:41 PM Gilbert Song <gilb...@mesosphere.io>
> > wrote:
> >
> > > Do we have the recorded video for this meeting?
> > >
> > > On Wed, May 16, 2018 at 1:54 PM, Benjamin Bannier <
> > > benjamin.bann...@mesosphere.io> wrote:
> > >
> > > > Hi Ben,
> > > >
> > > > thanks for taking the time to edit and share these detailed notes.
> > Being
> > > > able to asynchronously see the great work folks are doing surfaced is
> > > > great, especially when put into context with thought like here.
> > > >
> > > >
> > > > Benjamin
> > > >
> > > > > On May 16, 2018, at 8:06 PM, Benjamin Mahler <bmah...@apache.org>
> > > wrote:
> > > > >
> > > > > Hi folks,
> > > > >
> > > > > Here are some notes from the performance meeting today.
> > > > >
> > > > > (1) First I did a demo of flamescope, you can find it here:
> > > > > https://github.com/Netflix/flamescope
> > > > >
> > > > > It's a very useful tool, hopefully we can make it easier for users
> to
> > > > > generate the data that we can drop into flamescope when reporting
> any
> > > > > performance issues. One of the open questions is how `perf
> > --call-graph
> > > > > dwarf` compares to `perf -g` but with mesos compiled with frame
> > > > pointers. I
> > > > > haven't had time to check this yet.
> > > > >
> > > > > When playing with the tool, it was easy to find some hot spots in
> the
> > > > given
> > > > > cluster I was looking at (which was not necessarily
> representative).
> > > For
> > > > > the agent, jie filed:
> > > > >
> > > > > https://issues.apache.org/jira/browse/MESOS-8901
> > > > >
> > > > > And for the master, I noticed that metrics, state json generation
> (no
> > > > > surprise), and a particular spot in the allocator were very
> > expensive.
> > > > >
> > > > > Metrics we'd like to address via migration to push gauges (Zhitao
> has
> > > > > offered to help with this effort):
> > > > >
> > > > > https://issues.apache.org/jira/browse/MESOS-8914
> > > > >
> > > > > The state generation we'd like to address via streaming state into
> a
> > > > > separate actor (and providing filtering as well), this will get
> > further
> > > > > investigated / prioritized very soon:
> > > > >
> > > > > https://issues.apache.org/jira/browse/MESOS-8345
> > > > >
> > > > > (2) Kapil discussed benchmarks for the long standing "offer
> > starvation"
> > > > > issue:
> > > > >
> > > > > https://issues.apache.org/jira/browse/MESOS-3202
> > > > >
> > > > > I'll send out an email or document soon with some background on
> this
> > > > issue
> > > > > as well as our options to address it.
> > > > >
> > > > > Let me know if you have any questions or feedback!
> > > > >
> > > > > Ben
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "Apache Mesos Mail Lists" group.
> > > > Visit this group at
> > > https://groups.google.com/a/mesosphere.io/group/mesos-
> > > > mail/.
> > > > For more options, visit
> > > https://groups.google.com/a/mesosphere.io/d/optout
> > > > .
> > > >
> > >
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>


Re: [VOTE] Release Apache Mesos 1.5.1 (rc1)

2018-05-23 Thread Benjamin Mahler
+1 (binding)

make check passes on macOS 10.13.4 with Apple LLVM version 9.1.0
(clang-902.0.39.1)

On Fri, May 11, 2018 at 12:35 PM, Gilbert Song  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.5.1.
>
> 1.5.1 includes the following:
> 
> 
> * [MESOS-1720] - Slave should send exited executor message when the
> executor is never launched.
> * [MESOS-7742] - Race conditions in IOSwitchboard: listening on unix socket
> and premature closing of the connection.
> * [MESOS-8125] - Agent should properly handle recovering an executor when
> its pid is reused.
> * [MESOS-8411] - Killing a queued task can lead to the command executor
> never terminating.
> * [MESOS-8416] - CHECK failure if trying to recover nested containers but
> the framework checkpointing is not enabled.
> * [MESOS-8468] - `LAUNCH_GROUP` failure tears down the default executor.
> * [MESOS-8488] - Docker bug can cause unkillable tasks.
> * [MESOS-8510] - URI disk profile adaptor does not consider plugin type for
> a profile.
> * [MESOS-8536] - Pending offer operations on resource provider resources
> not properly accounted for in allocator.
> * [MESOS-8550] - Bug in `Master::detected()` leads to coredump in
> `MasterZooKeeperTest.MasterInfoAddress`.
> * [MESOS-8552] - CGROUPS_ROOT_PidNamespaceForward and
> CGROUPS_ROOT_PidNamespaceBackward tests fail.
> * [MESOS-8565] - Persistent volumes are not visible in Mesos UI when
> launching a pod using default executor.
> * [MESOS-8569] - Allow newline characters when decoding base64 strings in
> stout.
> * [MESOS-8574] - Docker executor makes no progress when 'docker inspect'
> hangs.
> * [MESOS-8575] - Improve discard handling for 'Docker::stop' and
> 'Docker::pull'.
> * [MESOS-8576] - Improve discard handling of 'Docker::inspect()'.
> * [MESOS-8577] - Destroy nested container if
> `LAUNCH_NESTED_CONTAINER_SESSION` fails.
> * [MESOS-8594] - Mesos master stack overflow in libprocess socket send
> loop.
> * [MESOS-8598] - Allow empty resource provider selector in
> `UriDiskProfileAdaptor`.
> * [MESOS-8601] - Master crashes during slave reregistration after failover.
> * [MESOS-8604] - Quota headroom tracking may be incorrect in the presence
> of hierarchical reservation.
> * [MESOS-8605] - Terminal task status update will not send if 'docker
> inspect' is hung.
> * [MESOS-8619] - Docker on Windows uses `USERPROFILE` instead of `HOME` for
> credentials.
> * [MESOS-8624] - Valid tasks may be explicitly dropped by agent due to race
> conditions.
> * [MESOS-8631] - Agent should be able to start a task with every CPU on a
> Windows machine.
> * [MESOS-8641] - Event stream could send heartbeat before subscribed.
> * [MESOS-8646] - Agent should be able to resolve file names on open files.
> * [MESOS-8651] - Potential memory leaks in the `volume/sandbox_path`
> isolator.
> * [MESOS-8741] - `Add` to sequence will not run if it races with sequence
> destruction.
> * [MESOS-8742] - Agent resource provider config API calls should be
> idempotent.
> * [MESOS-8786] - CgroupIsolatorProcess accesses subsystem processes
> directly.
> * [MESOS-8787] - RP-related API should be experimental.
> * [MESOS-8876] - Normal exit of Docker container using rexray volume
> results in TASK_FAILED.
> * [MESOS-8881] - Enable epoll backend in libevent integration.
> * [MESOS-8885] - Disable libevent debug mode.
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.5.1-rc1
> 
> 
>
> The candidate for Mesos 1.5.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/mesos-1.5.1.tar.gz
>
> The tag to be voted on is 1.5.1-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.1-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/
> mesos-1.5.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/
> mesos-1.5.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1224
>
> Please vote on releasing this package as Apache Mesos 1.5.1!
>
> The vote is open until Wed May 16 12:31:02 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.5.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Gilbert
>


Re: [VOTE] Release Apache Mesos 1.3.3 (rc1)

2018-05-23 Thread Benjamin Mahler
Thanks Michael!

Looks like the tar.gz is empty, is it just me?

On Tue, May 22, 2018 at 10:09 PM, Michael Park  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.3.3.
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.3.3-rc1
> 
> 
>
> The candidate for Mesos 1.3.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/mesos-1.3.3.tar.gz
>
> The tag to be voted on is 1.3.3-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.3.3-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/
> mesos-1.3.3.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.3-rc1/
> mesos-1.3.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1226
>
> Please vote on releasing this package as Apache Mesos 1.3.3!
>
> The vote is open until Fri May 25 22:07:39 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.3.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> MPark
>


High Level Design Doc: Offer Starvation

2018-05-18 Thread Benjamin Mahler
Hi folks,

One of the long standing issues with running many frameworks on Mesos is
the presence of what is called "offer starvation". This is when some
role/framework that has unsatisfied demand is not receiving offers, while
mesos is continually sends offers to other roles/frameworks that don't want
them. This was originally captured via:

https://issues.apache.org/jira/browse/MESOS-3202

It's currently not possible to program a well-behaved scheduler to avoid
this issue, since the only mechanisms schedulers have today is to SUPPRESS
if they have no work to do and otherwise filter offers that aren't needed
for a timeout. However, a scheduler that has short lived workloads must
REVIVE frequently (which clears all of its filters). With a sufficient
number of these frameworks Mesos may not be able to allocate all the
available resources. See the Background section of the document for more
details.

This document goes over the background of the issue and covers various
solutions for addressing it. Some of them are longer term and would merit
their own design doc:

https://docs.google.com/document/d/1uvTmBo_21Ul9U_mijgWyh7hE0E_yZXrFr43JIB9OCl8

The current thinking is that it would be simplest in the short term to
provide an alternative sorter to DRF that can be chosen when starting the
master (e.g. random). In the medium term, we may add demand-awareness, and
long term migrate to shared state scheduling.

Please share any feedback or questions, thanks!

Ben


Re: [mesos-mail] Re: [Performance WG] Notes from meeting today

2018-05-16 Thread Benjamin Mahler
+Judith

There should be a recording. Judith, do you know where they get posted?

Benjamin, glad to hear it's useful, I'll continue doing it!

On Wed, May 16, 2018 at 4:41 PM Gilbert Song <gilb...@mesosphere.io> wrote:

> Do we have the recorded video for this meeting?
>
> On Wed, May 16, 2018 at 1:54 PM, Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
> > Hi Ben,
> >
> > thanks for taking the time to edit and share these detailed notes. Being
> > able to asynchronously see the great work folks are doing surfaced is
> > great, especially when put into context with thought like here.
> >
> >
> > Benjamin
> >
> > > On May 16, 2018, at 8:06 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
> > >
> > > Hi folks,
> > >
> > > Here are some notes from the performance meeting today.
> > >
> > > (1) First I did a demo of flamescope, you can find it here:
> > > https://github.com/Netflix/flamescope
> > >
> > > It's a very useful tool, hopefully we can make it easier for users to
> > > generate the data that we can drop into flamescope when reporting any
> > > performance issues. One of the open questions is how `perf --call-graph
> > > dwarf` compares to `perf -g` but with mesos compiled with frame
> > pointers. I
> > > haven't had time to check this yet.
> > >
> > > When playing with the tool, it was easy to find some hot spots in the
> > given
> > > cluster I was looking at (which was not necessarily representative).
> For
> > > the agent, jie filed:
> > >
> > > https://issues.apache.org/jira/browse/MESOS-8901
> > >
> > > And for the master, I noticed that metrics, state json generation (no
> > > surprise), and a particular spot in the allocator were very expensive.
> > >
> > > Metrics we'd like to address via migration to push gauges (Zhitao has
> > > offered to help with this effort):
> > >
> > > https://issues.apache.org/jira/browse/MESOS-8914
> > >
> > > The state generation we'd like to address via streaming state into a
> > > separate actor (and providing filtering as well), this will get further
> > > investigated / prioritized very soon:
> > >
> > > https://issues.apache.org/jira/browse/MESOS-8345
> > >
> > > (2) Kapil discussed benchmarks for the long standing "offer starvation"
> > > issue:
> > >
> > > https://issues.apache.org/jira/browse/MESOS-3202
> > >
> > > I'll send out an email or document soon with some background on this
> > issue
> > > as well as our options to address it.
> > >
> > > Let me know if you have any questions or feedback!
> > >
> > > Ben
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Apache Mesos Mail Lists" group.
> > Visit this group at
> https://groups.google.com/a/mesosphere.io/group/mesos-
> > mail/.
> > For more options, visit
> https://groups.google.com/a/mesosphere.io/d/optout
> > .
> >
>


[Performance WG] Notes from meeting today

2018-05-16 Thread Benjamin Mahler
Hi folks,

Here are some notes from the performance meeting today.

(1) First I did a demo of flamescope, you can find it here:
https://github.com/Netflix/flamescope

It's a very useful tool, hopefully we can make it easier for users to
generate the data that we can drop into flamescope when reporting any
performance issues. One of the open questions is how `perf --call-graph
dwarf` compares to `perf -g` but with mesos compiled with frame pointers. I
haven't had time to check this yet.

When playing with the tool, it was easy to find some hot spots in the given
cluster I was looking at (which was not necessarily representative). For
the agent, jie filed:

https://issues.apache.org/jira/browse/MESOS-8901

And for the master, I noticed that metrics, state json generation (no
surprise), and a particular spot in the allocator were very expensive.

Metrics we'd like to address via migration to push gauges (Zhitao has
offered to help with this effort):

https://issues.apache.org/jira/browse/MESOS-8914

The state generation we'd like to address via streaming state into a
separate actor (and providing filtering as well), this will get further
investigated / prioritized very soon:

https://issues.apache.org/jira/browse/MESOS-8345

(2) Kapil discussed benchmarks for the long standing "offer starvation"
issue:

https://issues.apache.org/jira/browse/MESOS-3202

I'll send out an email or document soon with some background on this issue
as well as our options to address it.

Let me know if you have any questions or feedback!

Ben


Re: Add hostname or agentid in rescind offers callback

2018-05-07 Thread Benjamin Mahler
No, it doesn't sound reasonable to me given the provided reasoning.

On Sun, May 6, 2018 at 11:54 PM Varun Gupta <var...@uber.com> wrote:

> Does the feature request seems reasonable?
> On Wed, May 2, 2018 at 7:35 PM Varun Gupta <var...@uber.com> wrote:
>
> > Implementation is only needed for V1 API.
> > On Wed, May 2, 2018 at 7:31 PM Varun Gupta <var...@uber.com> wrote:
> >
> >> We aggregate all the offers for a host, such that placement engine can
> >> pack multiple tasks that can be launched on this host using aggregated
> >> resources. If offers are unused for that host then they will be
> implicitly
> >> declined.
> >>
> >> Placement cycle has two components
> >> - determine tasks that can be launched on this host and enque those
> tasks
> >> to be launched.
> >> - second is to deque and launch them.
> >>
> >> During this placement cycle, offers can be rescinded and accordingly
> >> placement has to be adjusted. For that we maintain second map of offer
> id
> >> -> host name.
> >>
> >> Benefits of adding host name and agent id in rescind offer callback.
> >> - Mutex locks to synchronize both maps, leads to some performance hit.
> >> - Managing second map, is more code and prone to bugs.
> >> - Little overhead on heap memory and GC.
> >> On Wed, May 2, 2018 at 5:08 PM Benjamin Mahler <bmah...@apache.org>
> >> wrote:
> >>
> >>> I'm a -1 on adding redundant information in the message.
> >>>
> >>> The scheduler can maintain an index of offers by offer id to address
> this
> >>> issue:
> >>>
> >>> hostname -> offers
> >>> offer_id -> offer
> >>>
> >>> On Wed, May 2, 2018 at 11:39 AM, Vinod Kone <vinodk...@apache.org>
> >>> wrote:
> >>>
> >>> > Can I ask why you are indexing the offers by hostname? Is it to
> better
> >>> > handle agent removal / unreachable signal?
> >>> >
> >>> > Looking at the code
> >>> > <
> >>>
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11036
> >>> >
> >>> > ,
> >>> > I think master has the requested information (agent id, hostname) so
> >>> we can
> >>> > include it in the rescind message!
> >>> >
> >>> > But there are couple things to discuss.
> >>> >
> >>> > The extra information to be included in rescind message is
> technically
> >>> > redundant. So we need to figure out a guideline on what information
> >>> should
> >>> > be included / not included (e.g., should we include agent IP too) in
> >>> such
> >>> > calls.
> >>> >
> >>> > Second, adding this extra information in v1 scheduler API would be
> >>> > relatively easy. But adding this to v0 API would be hard. Which API
> do
> >>> you
> >>> > need to be updated?
> >>> >
> >>> >
> >>> > On Wed, May 2, 2018 at 10:31 AM, Varun Gupta <var...@uber.com>
> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > Currently in our implementation we maintain two maps.
> >>> > >
> >>> > > Hostname -> []Offers
> >>> > >
> >>> > > offerID -> Hostname
> >>> > >
> >>> > > Second map is needed because rescind offers callback only provides
> >>> > offerid
> >>> > > and we need hostname to do performant lookup in first map.
> >>> > >
> >>> > > Is it feasible to add hostname or agentid in rescind offers?
> >>> > >
> >>> > > Thanks,
> >>> > > Varun
> >>> > >
> >>> >
> >>>
> >>
>


Re: 答复: libprocess libevent backend

2018-05-06 Thread Benjamin Mahler
Thanks, will look into getting this re-enabled.

Please feel free to try out the fix and let us know if you see any issues.

On Thu, May 3, 2018 at 11:35 PM, Suteng <sut...@huawei.com> wrote:

> Create an issue
> https://issues.apache.org/jira/browse/MESOS-8881
>
> -邮件原件-----
> 发件人: Benjamin Mahler [mailto:bmah...@apache.org]
> 发送时间: 2018年5月4日 11:47
> 收件人: Joris Van Remoortere <joris.van.remoort...@gmail.com>;
> dev@mesos.apache.org
> 主题: Re: libprocess libevent backend
>
> +Joris
>
> Wow, Joris do you know why this is disabled? What were the issues?
>
> Suteng, can you file a JIRA ticket?
>
> On Thu, May 3, 2018 at 6:26 PM Suteng <sut...@huawei.com> wrote:
>
> >
> > In libprocess libevent.cpp, is avoid to using epoll. These is the code:
> >
> > /home/suteng/code/mesos/3rdparty/libprocess/src/libevent.cpp
> >
> > 206   // TODO(jmlvanre): Allow support for 'epoll' once SSL related
> > 207   // issues are resolved.
> > 208   struct event_config* config = event_config_new();
> > 209   event_config_avoid_method(config, "epoll");
> >
> >
> >
> > -邮件原件-
> > 发件人: Benjamin Mahler [mailto:bmah...@apache.org]
> > 发送时间: 2018年5月4日 3:33
> > 收件人: dev <dev@mesos.apache.org>
> > 主题: Re: libprocess libevent backend
> >
> > Which issue are you referring to?
> >
> > Libprocess uses libev by default, with --enable-libevent as a
> > configure option to use libevent instead. Both of these backends
> > should use epoll if the system has it available. Are you seeing
> otherwise?
> >
> > On Thu, May 3, 2018 at 6:15 AM, Suteng <sut...@huawei.com> wrote:
> >
> > > libprocess uie poll as libevent backend, can change to epoll to
> > > improve performance ?
> > >
> > > There is a TODO issue, is resolved?
> > >
> > >
> > >
> > >
> > >
> > > Thanks,
> > >
> > > SU Teng
> > >
> > >
> > >
> > >
> > >
> > > Su Teng  00241668
> > >
> > >
> > >
> > > Distributed and Parallel Software Lab
> > >
> > > Huawei Technologies Co., Ltd.
> > >
> > > Email:sut...@huawei.com
> > >
> > >
> > >
> > >
> > >
> >
>


Re: libprocess libevent backend

2018-05-03 Thread Benjamin Mahler
+Joris

Wow, Joris do you know why this is disabled? What were the issues?

Suteng, can you file a JIRA ticket?

On Thu, May 3, 2018 at 6:26 PM Suteng <sut...@huawei.com> wrote:

>
> In libprocess libevent.cpp, is avoid to using epoll. These is the code:
>
> /home/suteng/code/mesos/3rdparty/libprocess/src/libevent.cpp
>
> 206   // TODO(jmlvanre): Allow support for 'epoll' once SSL related
> 207   // issues are resolved.
> 208   struct event_config* config = event_config_new();
> 209   event_config_avoid_method(config, "epoll");
>
>
>
> -邮件原件-
> 发件人: Benjamin Mahler [mailto:bmah...@apache.org]
> 发送时间: 2018年5月4日 3:33
> 收件人: dev <dev@mesos.apache.org>
> 主题: Re: libprocess libevent backend
>
> Which issue are you referring to?
>
> Libprocess uses libev by default, with --enable-libevent as a configure
> option to use libevent instead. Both of these backends should use epoll if
> the system has it available. Are you seeing otherwise?
>
> On Thu, May 3, 2018 at 6:15 AM, Suteng <sut...@huawei.com> wrote:
>
> > libprocess uie poll as libevent backend, can change to epoll to
> > improve performance ?
> >
> > There is a TODO issue, is resolved?
> >
> >
> >
> >
> >
> > Thanks,
> >
> > SU Teng
> >
> >
> >
> >
> >
> > Su Teng  00241668
> >
> >
> >
> > Distributed and Parallel Software Lab
> >
> > Huawei Technologies Co., Ltd.
> >
> > Email:sut...@huawei.com
> >
> >
> >
> >
> >
>


Re: libprocess libevent backend

2018-05-03 Thread Benjamin Mahler
Which issue are you referring to?

Libprocess uses libev by default, with --enable-libevent as a configure
option to use libevent instead. Both of these backends should use epoll if
the system has it available. Are you seeing otherwise?

On Thu, May 3, 2018 at 6:15 AM, Suteng  wrote:

> libprocess uie poll as libevent backend, can change to epoll to improve
> performance ?
>
> There is a TODO issue, is resolved?
>
>
>
>
>
> Thanks,
>
> SU Teng
>
>
>
>
>
> Su Teng  00241668
>
>
>
> Distributed and Parallel Software Lab
>
> Huawei Technologies Co., Ltd.
>
> Email:sut...@huawei.com
>
>
>
>
>


Re: Add hostname or agentid in rescind offers callback

2018-05-02 Thread Benjamin Mahler
I'm a -1 on adding redundant information in the message.

The scheduler can maintain an index of offers by offer id to address this
issue:

hostname -> offers
offer_id -> offer

On Wed, May 2, 2018 at 11:39 AM, Vinod Kone  wrote:

> Can I ask why you are indexing the offers by hostname? Is it to better
> handle agent removal / unreachable signal?
>
> Looking at the code
> 
> ,
> I think master has the requested information (agent id, hostname) so we can
> include it in the rescind message!
>
> But there are couple things to discuss.
>
> The extra information to be included in rescind message is technically
> redundant. So we need to figure out a guideline on what information should
> be included / not included (e.g., should we include agent IP too) in such
> calls.
>
> Second, adding this extra information in v1 scheduler API would be
> relatively easy. But adding this to v0 API would be hard. Which API do you
> need to be updated?
>
>
> On Wed, May 2, 2018 at 10:31 AM, Varun Gupta  wrote:
>
> > Hi,
> >
> > Currently in our implementation we maintain two maps.
> >
> > Hostname -> []Offers
> >
> > offerID -> Hostname
> >
> > Second map is needed because rescind offers callback only provides
> offerid
> > and we need hostname to do performant lookup in first map.
> >
> > Is it feasible to add hostname or agentid in rescind offers?
> >
> > Thanks,
> > Varun
> >
>


Re: Question on status update retry in agent

2018-04-18 Thread Benjamin Mahler
 node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050
>
> W0416 00:41:48.425532 124539 status_update_manager.cpp:478] Resending
> status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef)
> for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
>
> I discussed with @zhitao. Here, is a plausible explanation of the bug.
> Executor sends, updates to agent every 60 seconds, and agents maintains
> them in pending
> <https://github.com/apache/mesos/blob/master/src/slave/task_
> status_update_manager.hpp#L173>
> queue. Now when the ack comes, they can be in any order for the status
> update but _handle section pops
> <https://github.com/apache/mesos/blob/master/src/slave/task_
> status_update_manager.cpp#L888>
> last update from the queue without making sure, ack was for that status
> update.
>
>
>
>
> On Tue, Apr 10, 2018 at 6:43 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
> > Do you have logs? Which acknowledgements did the agent receive? Which
> > TASK_RUNNING in the sequence was it re-sending?
> >
> > On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler <bmah...@apache.org>
> > wrote:
> >
> > > > Issue is that, *old executor reference is hold by slave* (assuming it
> > > did not receive acknowledgement, whereas master and scheduler have
> > > processed the status updates), so it  continues to retry TASK_RUNNING
> > > infinitely.
> > >
> > > The agent only retries so long as it does not get an acknowledgement,
> is
> > > the scheduler acknowledging the duplicates updates or ignoring them?
> > >
> > > On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <var...@uber.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> We are running into an issue with slave status update manager. Below
> is
> > >> the
> > >> behavior I am seeing.
> > >>
> > >> Our use case is, we run Stateful container (Cassandra process), here
> > >> Executor polls JMX port at 60 second interval to get Cassandra State
> and
> > >> sends the state to agent -> master -> framework.
> > >>
> > >> *RUNNING Cassandra Process translates to TASK_RUNNING.*
> > >> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
> > >>
> > >> At some point slave has multiple TASK_RUNNING status updates in stream
> > and
> > >> then followed by TASK_FAILED if acknowledgements are pending. We use
> > >> explicit acknowledgements, and I see Mesos Master receives, all
> > >> TASK_RUNNING and then TASK_FAILED as well as Framework also receives
> all
> > >> TASK_RUNNING updates followed up TASK_FAILED. After receiving
> > TASK_FAILED,
> > >> Framework restarts different executor on same machine using old
> > persistent
> > >> volume.
> > >>
> > >> Issue is that, *old executor reference is hold by slave* (assuming it
> > did
> > >> not receive acknowledgement, whereas master and scheduler have
> processed
> > >> the status updates), so it  continues to retry TASK_RUNNING
> infinitely.
> > >> Here, old executor process is not running. As well as new executor
> > process
> > >> is running, and continues to work as-is. This makes be believe, some
> bug
> > >> with slave status update manager.
> > >>
> > >> I read slave status update manager code, recover
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L203>
> > >>  has a constraint
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L239>
> > >>  to ignore status updates from stream if the last executor run is
> > >> completed.
> > >>
> > >> I think, similar constraint should be applicable for status update
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L318>
> > >>  and acknowledge
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L760>
> > >> .
> > >>
> > >>
> > >> Thanks,
> > >> Varun
> > >>
> > >>
> > >>
> > >>
> > >>
> > >

Re: Performance Working Group Agenda for Tomorrow

2018-04-18 Thread Benjamin Mahler
I'll cancel this one and we can aim to meet again next month.

On Tue, Apr 17, 2018 at 12:06 PM Benjamin Mahler <bmah...@apache.org> wrote:

> Do folks have any agenda items they would like to discuss for tomorrow's
> performance working group meeting?
>
> There haven't been a lot of performance related activity in the past
> month, so will cancel this one unless folks chime in here.
>
> Ben
>


Performance Working Group Agenda for Tomorrow

2018-04-17 Thread Benjamin Mahler
Do folks have any agenda items they would like to discuss for tomorrow's
performance working group meeting?

There haven't been a lot of performance related activity in the past month,
so will cancel this one unless folks chime in here.

Ben


Re: Proposal: Asynchronous IO on Windows

2018-04-12 Thread Benjamin Mahler
Thanks for writing this up and exploring the different options Akash!

I left some comments in the doc. It seems to me the windows thread pool API
is a mix of "event" processing (timers, i/o), as well a work queue. Since
libprocess already provides a work queue via `Process`es, there's some
overlap there. I assume that using the event processing subset of the
windows thread pool API along with just 1 fixed thread is essentially
equivalent to having a "windows event loop"? We won't be using the work
queue aspect of the windows thread pool, right?

On Thu, Apr 12, 2018 at 11:58 AM, Akash Gupta (EOSG) <
aka...@microsoft.com.invalid> wrote:

> Hi all,
>
> A few weeks ago, we found serious issues with the current asynchronous IO
> implementation on Windows. The two eventing libraries in Mesos (libevent
> and libev) use `select` on Windows, which is socket-only on Windows.  In
> fact, both of these libraries typedef their socket type as SOCKET, so
> passing in an arbitrary file handle should not even compile. Essentially,
> they aren't suitable for general purpose asynchronous IO on Windows.
>
> This bug wasn't found earlier due to a number of reasons. Mesos has a
> `WindowsFD` class that multiplexes and demultiplexes the different Windows
> file types (HANDLE & SOCKET) into a singular type in order to work similar
> to UNIX platforms that use `int` for any type of file descriptor. Since
> WindowsFD is castable to a SOCKET, there were no compile errors for using
> HANDLES in libevent. Furthermore, none of the Windows HANDLEs were opened
> in asynchronous mode, so they were always blocking. This means that
> currently, any non-socket IO in Mesos blocks on Windows, so we never got
> runtime errors for sending arbitrary handles to libevent's event loop.
> Also, some of the unit tests that would catch this blocking behavior like
> in io_tests.cpp were disabled, so it was never caught in the unit tests.
>
> We wrote up a proposal on implementing asynchronous IO on Windows. The
> proposal is split into two parts that focus on stout and libprocess
> changes. The stout changes focus on opening and using asynchronous handles
> in the stout IO implementations. The libprocess changes focus on replacing
> libevent with another eventing library. We propose using the Windows
> Threadpool library, which is a native Win32 API that works like an event
> loop by allowing the user to schedule asynchronous events. Both Mesos and
> Windows uses the proactor IO pattern, so they map very cleanly. We prefer
> it over other asynchronous libraries like libuv and ASIO, since they have
> some issues mentioned in the design proposal like missing some features due
> to supporting older Windows versions. However, we understand the
> maintenance burden of adding another library, so we're looking for feedback
> on the design proposal.
>
> Link to JIRA issue: https://issues.apache.org/jira/browse/MESOS-8668
>
> Link to design doc:  https://docs.google.com/document/d/1VG_
> 8FTpWHiC7pKPoH4e-Yp7IFvAm-2wcFuk63lYByqo/edit?usp=sharing
>
> Thanks,
> Akash
>


Re: Question on status update retry in agent

2018-04-10 Thread Benjamin Mahler
Do you have logs? Which acknowledgements did the agent receive? Which
TASK_RUNNING in the sequence was it re-sending?

On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler <bmah...@apache.org> wrote:

> > Issue is that, *old executor reference is hold by slave* (assuming it
> did not receive acknowledgement, whereas master and scheduler have
> processed the status updates), so it  continues to retry TASK_RUNNING
> infinitely.
>
> The agent only retries so long as it does not get an acknowledgement, is
> the scheduler acknowledging the duplicates updates or ignoring them?
>
> On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <var...@uber.com> wrote:
>
>> Hi,
>>
>> We are running into an issue with slave status update manager. Below is
>> the
>> behavior I am seeing.
>>
>> Our use case is, we run Stateful container (Cassandra process), here
>> Executor polls JMX port at 60 second interval to get Cassandra State and
>> sends the state to agent -> master -> framework.
>>
>> *RUNNING Cassandra Process translates to TASK_RUNNING.*
>> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
>>
>> At some point slave has multiple TASK_RUNNING status updates in stream and
>> then followed by TASK_FAILED if acknowledgements are pending. We use
>> explicit acknowledgements, and I see Mesos Master receives, all
>> TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
>> TASK_RUNNING updates followed up TASK_FAILED. After receiving TASK_FAILED,
>> Framework restarts different executor on same machine using old persistent
>> volume.
>>
>> Issue is that, *old executor reference is hold by slave* (assuming it did
>> not receive acknowledgement, whereas master and scheduler have processed
>> the status updates), so it  continues to retry TASK_RUNNING infinitely.
>> Here, old executor process is not running. As well as new executor process
>> is running, and continues to work as-is. This makes be believe, some bug
>> with slave status update manager.
>>
>> I read slave status update manager code, recover
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L203>
>>  has a constraint
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L239>
>>  to ignore status updates from stream if the last executor run is
>> completed.
>>
>> I think, similar constraint should be applicable for status update
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L318>
>>  and acknowledge
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L760>
>> .
>>
>>
>> Thanks,
>> Varun
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>>
>> > (1) Assuming you're referring to the scheduler's acknowledgement of a
>> > status update, the agent will not forward TS2 until TS1 has been
>> > acknowledged. So, TS2 will not be acknowledged before TS1 is
>> acknowledged.
>> > FWICT, we'll ignore any violation of this ordering and log a warning.
>> >
>> > (2) To reverse the question, why would it make sense to ignore them?
>> > Assuming you're looking to reduce the number of round trips needed for
>> > schedulers to see the terminal update, I would point you to:
>> > https://issues.apache.org/jira/browse/MESOS-6941
>> >
>> > (3) When the agent sees an executor terminate, it will transition all
>> > non-terminal tasks assigned to that executor to TASK_GONE (partition
>> aware
>> > framework) or TASK_LOST (non partition aware framework) or TASK_FAILED
>> (if
>> > the container OOMed). There may be other cases, it looks a bit
>> convoluted
>> > to me.
>> >
>> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zhitaoli...@gmail.com>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > While designing the correct behavior with one of our framework, we
>> > > encounters some questions about behavior of status update:
>> > >
>> > > The executor continuously polls the workload probe to get current
>> mode of
>> > > workload (a Cassandra server), and send various status update states
>> > > (STARTING, RUNNING, FAILED, etc).
>> > >

Re: Question on status update retry in agent

2018-04-10 Thread Benjamin Mahler
> Issue is that, *old executor reference is hold by slave* (assuming it did not
receive acknowledgement, whereas master and scheduler have processed the
status updates), so it  continues to retry TASK_RUNNING infinitely.

The agent only retries so long as it does not get an acknowledgement, is
the scheduler acknowledging the duplicates updates or ignoring them?

On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <var...@uber.com> wrote:

> Hi,
>
> We are running into an issue with slave status update manager. Below is the
> behavior I am seeing.
>
> Our use case is, we run Stateful container (Cassandra process), here
> Executor polls JMX port at 60 second interval to get Cassandra State and
> sends the state to agent -> master -> framework.
>
> *RUNNING Cassandra Process translates to TASK_RUNNING.*
> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
>
> At some point slave has multiple TASK_RUNNING status updates in stream and
> then followed by TASK_FAILED if acknowledgements are pending. We use
> explicit acknowledgements, and I see Mesos Master receives, all
> TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
> TASK_RUNNING updates followed up TASK_FAILED. After receiving TASK_FAILED,
> Framework restarts different executor on same machine using old persistent
> volume.
>
> Issue is that, *old executor reference is hold by slave* (assuming it did
> not receive acknowledgement, whereas master and scheduler have processed
> the status updates), so it  continues to retry TASK_RUNNING infinitely.
> Here, old executor process is not running. As well as new executor process
> is running, and continues to work as-is. This makes be believe, some bug
> with slave status update manager.
>
> I read slave status update manager code, recover
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L203>
>  has a constraint
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L239>
>  to ignore status updates from stream if the last executor run is
> completed.
>
> I think, similar constraint should be applicable for status update
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L318>
>  and acknowledge
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L760>
> .
>
>
> Thanks,
> Varun
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
> > (1) Assuming you're referring to the scheduler's acknowledgement of a
> > status update, the agent will not forward TS2 until TS1 has been
> > acknowledged. So, TS2 will not be acknowledged before TS1 is
> acknowledged.
> > FWICT, we'll ignore any violation of this ordering and log a warning.
> >
> > (2) To reverse the question, why would it make sense to ignore them?
> > Assuming you're looking to reduce the number of round trips needed for
> > schedulers to see the terminal update, I would point you to:
> > https://issues.apache.org/jira/browse/MESOS-6941
> >
> > (3) When the agent sees an executor terminate, it will transition all
> > non-terminal tasks assigned to that executor to TASK_GONE (partition
> aware
> > framework) or TASK_LOST (non partition aware framework) or TASK_FAILED
> (if
> > the container OOMed). There may be other cases, it looks a bit convoluted
> > to me.
> >
> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > While designing the correct behavior with one of our framework, we
> > > encounters some questions about behavior of status update:
> > >
> > > The executor continuously polls the workload probe to get current mode
> of
> > > workload (a Cassandra server), and send various status update states
> > > (STARTING, RUNNING, FAILED, etc).
> > >
> > > Executor polls every 30 seconds, and sends status update. Here, we are
> > > seeing congestion on task update acknowledgements somewhere (still
> > > unknown).
> > >
> > > There are three scenarios that we want to understand.
> > >
> > >1. Agent queue has task update TS1, TS2 & TS3 (in this order)
> waiting
> > on
> > >acknowledgement. Suppose if TS2 receives an acknowledgement, then
> what
> > > will
> > >happen to TS1 update in the queue.
> > >
> > >
> > >1. Agent queue has task update TS1, TS2, TS3 & TA

Re: Proposal: Constrained upgrades from Mesos 1.6

2018-04-10 Thread Benjamin Mahler
-user

Do you have a link to the technical details of why this needs to be done?

For instance, why can't master/agent versions be used to determine which
behavior is performed between the master and agent?

On Tue, Apr 10, 2018 at 5:34 PM, Greg Mann  wrote:

> Hi all,
> We are currently working on patches to implement the new GROW_VOLUME and
> SHRINK_VOLUME operations [1]. In order to make it into Mesos 1.6, we're
> pursuing a workaround which affects the way these operations are accounted
> for in the Mesos master. These operations will be marked as *experimental* in
> Mesos 1.6.
>
> As a result of this workaround, upgrades from Mesos 1.6 to later versions
> would be affected. Specifically, 1.6 masters would not be able to properly
> account for the resources of failed GROW/SHRINK operations on 1.7+ agents.
> This means that when upgrading from Mesos 1.6, if GROW_VOLUME or
> SHRINK_VOLUME operations are being used during the upgrade, the masters
> *must* be upgraded first. If we follow this proposal, this constraint
> would be clearly spelled out in our upgrade documentation.
>
> Since, in general, we guarantee compatibility between Mesos masters and
> agents of the same major version, we wanted to check with the community to
> see if this constraint on 1.6 upgrades would be acceptable. Please let us
> know what you think!
>
> Cheers,
> Greg
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-4965
>


CHECK_NOTNONE / CHECK_NOTERROR

2018-04-10 Thread Benjamin Mahler
Just an FYI about some recently added CHECKs that make some minor changes
to the way we write code:

(1) CHECK_NOTNONE:

Much like glog's CHECK_NOTNULL, sometimes you know from invariants that an
Option cannot be in the none state and you want to "de-reference" it
without writing logic to handle the none case:

Option func(...);
T t = CHECK_NOTNULL(func(...));

Option some_option = ...;
T t = CHECK_NOTNONE(std::move(some_option));

Our existing code tends to use unguarded .get() calls, so hopefully this
new pattern makes it clearer when we're assuming an option is in the some
case.

(2) CHECK_NOTERROR:

This is the same, but for Try:

Try func(...);
T t = CHECK_NOTERROR(func(...));

Try some_try = ...;
T t = CHECK_NOTERROR(std::move(some_try));

You can find them here:
https://github.com/apache/mesos/blob/88f5629e510d71a32bd7e0ff7ee09e150f944e72/3rdparty/stout/include/stout/check.hpp


Re: Tasks not getting killed

2018-04-10 Thread Benjamin Mahler
It's the executor's responsibility to forcefully kill a task after the task
kill grace period. However, in your case it sounds like the executor is
getting stuck? What is happening in the executor? If the executor is alive
but doesn't implement the grace period force kill logic, the solution is to
update the executor to handle grace periods and pass the grace period from
the scheduler side.

If its the executor that is stuck, the scheduler can issue a SHUTDOWN and
after an agent configured timeout the executor will be forcefully killed:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/scheduler/scheduler.proto#L363-L373

However, this API is not possible to use reliably until MESOS-8167 is in
place.

There is also a KILL_CONTAINER agent API that allows you to manually kill a
stuck container as an operator:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/agent/agent.proto#L90

On Tue, Apr 3, 2018 at 8:59 PM, Venkat Morampudi 
wrote:

> Hi,
>
> We have framework that launched Spark jobs on our Mesos cluster. We are
> currently having an issue where Spark jobs are getting stuck due to some
> timeout issue. We have cancel functionality that would kill send task_kill
> message to master. When the jobs get stuck Spark driver task is not getting
> killed even though the agent on the node that driver is running get the
> kill request. Is there any timeout that I can set so that Mesos agent can
> force kill the task in this scenario? Really appreciate your help.
>
> Thanks,
> Venkat
>
>
> Log entry from agent logs:
>
> I0404 03:44:47.367276 55066 slave.cpp:2035] Asked to kill task 79668.0.0
> of framework 35e600c2-6f43-402c-856f-9084c0040187-002
>
>


  1   2   3   4   5   6   7   8   9   10   >