Awesome. Thanks for the write up, Ben!

On Wed, Jul 18, 2018 at 2:55 PM Benjamin Mahler <bmah...@apache.org> wrote:

> For folks that missed it, here are my own notes. Thanks to alexr and dario
> for presenting!
>
> (1) I discussed a high agent cpu usage issue when hitting the /containers
> endpoint:
>
> https://issues.apache.org/jira/browse/MESOS-8418
>
> This was resolved, but it didn't get attention for months until I noticed a
> recent complaint about it in slack. It highlights the need to periodically
> check for new performance tickets in the backlog.
>
>
> (2) alexr presented slides on some ongoing work to improve the state
> serving performance:
>
>
> https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g
>
> This included measurements from clusters with many frameworks. The short
> term plan (hopefully in 1.7.0) is to investigate batching / parallel
> processing of state requests (still on the master actor), and halving the
> queueing time via authorizing outside of the master actor. There are
> potential longer term plans, but these short term improvements should take
> us pretty far, along with (3).
>
>
> (3) I presented some results from adapting our jsonify library to use
> rapidjson under the covers, and it cuts our state serving time in half:
>
>
> https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo
>
> The code is mainly done but there are a few things left to get it in a
> reviewable state.
>
>
> (4) I briefly mentioned some various other performance work:
>
>   (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some
> benchmarking and improvements were made to better handle a large number of
> metrics, in support of per-framework metrics:
>
> https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets)
>
> There's still more open work that can be done here, but a more critical
> user-facing improvement at this point is the migration to push gauges in
> the master and allocator:
>
> https://issues.apache.org/jira/browse/MESOS-8914
>
>   (b) JSON parsing cost was cut in half by avoiding conversion through an
> intermediate format and instead directly parsing into our data structures:
>
> https://issues.apache.org/jira/browse/MESOS-9067
>
>
> (5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on
> benchmarking and making performance improvements to the allocator to speed
> up allocation cycle time and to address "offer starvation". In our
> multi-framework scale testing we saw allocation cycle time go down from 15
> secs to 5 secs, and there's still lots of low hanging fruit:
>
> https://issues.apache.org/jira/browse/MESOS-9087
>
> For offer starvation, we fixed an offer fragmentation issue due to quota
> "chopping" and we introduced the choice of a random weighted shuffle sorter
> as an alternative to ensure that high share frameworks don't get starved.
> We may also investigate introducing a round-robin sorter that shuffles
> between rounds if needed:
>
> https://issues.apache.org/jira/browse/MESOS-8935
> https://issues.apache.org/jira/browse/MESOS-8936
>
>
> (6) Dario talked about the MPSC queue that was recently added to libprocess
> for use in Process event queues. This needs to be enabled at configure-time
> as is currently the case for the lock free structures, and should provide a
> throughput improvement to libprocess. We still need to chart a path to
> turning these libprocess performance enhancing features on by default.
>
>
> (7) I can draft a 1.7.0 performance improvements blog post that features
> all of these topics and more. We may need to pull out some of the more
> lengthy content into separate blog posts if needed, but I think from the
> user perspective, highlighting what they get in 1.7.0 performance wise will
> be nice.
>
> Agenda Doc:
>
> https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU
>
> Ben
>

Reply via email to