[Performance WG] Meeting Notes - July 18

Benjamin Mahler Wed, 18 Jul 2018 12:55:57 -0700

For folks that missed it, here are my own notes. Thanks to alexr and dario
for presenting!


(1) I discussed a high agent cpu usage issue when hitting the /containers
endpoint:

https://issues.apache.org/jira/browse/MESOS-8418

This was resolved, but it didn't get attention for months until I noticed a
recent complaint about it in slack. It highlights the need to periodically
check for new performance tickets in the backlog.


(2) alexr presented slides on some ongoing work to improve the state
serving performance:

https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g

This included measurements from clusters with many frameworks. The short
term plan (hopefully in 1.7.0) is to investigate batching / parallel
processing of state requests (still on the master actor), and halving the
queueing time via authorizing outside of the master actor. There are
potential longer term plans, but these short term improvements should take
us pretty far, along with (3).


(3) I presented some results from adapting our jsonify library to use
rapidjson under the covers, and it cuts our state serving time in half:

https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo

The code is mainly done but there are a few things left to get it in a
reviewable state.


(4) I briefly mentioned some various other performance work:

  (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some
benchmarking and improvements were made to better handle a large number of
metrics, in support of per-framework metrics:

https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets)

There's still more open work that can be done here, but a more critical
user-facing improvement at this point is the migration to push gauges in
the master and allocator:

https://issues.apache.org/jira/browse/MESOS-8914

  (b) JSON parsing cost was cut in half by avoiding conversion through an
intermediate format and instead directly parsing into our data structures:

https://issues.apache.org/jira/browse/MESOS-9067


(5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on
benchmarking and making performance improvements to the allocator to speed
up allocation cycle time and to address "offer starvation". In our
multi-framework scale testing we saw allocation cycle time go down from 15
secs to 5 secs, and there's still lots of low hanging fruit:

https://issues.apache.org/jira/browse/MESOS-9087

For offer starvation, we fixed an offer fragmentation issue due to quota
"chopping" and we introduced the choice of a random weighted shuffle sorter
as an alternative to ensure that high share frameworks don't get starved.
We may also investigate introducing a round-robin sorter that shuffles
between rounds if needed:

https://issues.apache.org/jira/browse/MESOS-8935
https://issues.apache.org/jira/browse/MESOS-8936


(6) Dario talked about the MPSC queue that was recently added to libprocess
for use in Process event queues. This needs to be enabled at configure-time
as is currently the case for the lock free structures, and should provide a
throughput improvement to libprocess. We still need to chart a path to
turning these libprocess performance enhancing features on by default.


(7) I can draft a 1.7.0 performance improvements blog post that features
all of these topics and more. We may need to pull out some of the more
lengthy content into separate blog posts if needed, but I think from the
user perspective, highlighting what they get in 1.7.0 performance wise will
be nice.

Agenda Doc:
https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU

Ben

[Performance WG] Meeting Notes - July 18

Reply via email to