For folks that missed it, here are my own notes. Thanks to alexr and dario for presenting!
(1) I discussed a high agent cpu usage issue when hitting the /containers endpoint: https://issues.apache.org/jira/browse/MESOS-8418 This was resolved, but it didn't get attention for months until I noticed a recent complaint about it in slack. It highlights the need to periodically check for new performance tickets in the backlog. (2) alexr presented slides on some ongoing work to improve the state serving performance: https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g This included measurements from clusters with many frameworks. The short term plan (hopefully in 1.7.0) is to investigate batching / parallel processing of state requests (still on the master actor), and halving the queueing time via authorizing outside of the master actor. There are potential longer term plans, but these short term improvements should take us pretty far, along with (3). (3) I presented some results from adapting our jsonify library to use rapidjson under the covers, and it cuts our state serving time in half: https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo The code is mainly done but there are a few things left to get it in a reviewable state. (4) I briefly mentioned some various other performance work: (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some benchmarking and improvements were made to better handle a large number of metrics, in support of per-framework metrics: https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets) There's still more open work that can be done here, but a more critical user-facing improvement at this point is the migration to push gauges in the master and allocator: https://issues.apache.org/jira/browse/MESOS-8914 (b) JSON parsing cost was cut in half by avoiding conversion through an intermediate format and instead directly parsing into our data structures: https://issues.apache.org/jira/browse/MESOS-9067 (5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on benchmarking and making performance improvements to the allocator to speed up allocation cycle time and to address "offer starvation". In our multi-framework scale testing we saw allocation cycle time go down from 15 secs to 5 secs, and there's still lots of low hanging fruit: https://issues.apache.org/jira/browse/MESOS-9087 For offer starvation, we fixed an offer fragmentation issue due to quota "chopping" and we introduced the choice of a random weighted shuffle sorter as an alternative to ensure that high share frameworks don't get starved. We may also investigate introducing a round-robin sorter that shuffles between rounds if needed: https://issues.apache.org/jira/browse/MESOS-8935 https://issues.apache.org/jira/browse/MESOS-8936 (6) Dario talked about the MPSC queue that was recently added to libprocess for use in Process event queues. This needs to be enabled at configure-time as is currently the case for the lock free structures, and should provide a throughput improvement to libprocess. We still need to chart a path to turning these libprocess performance enhancing features on by default. (7) I can draft a 1.7.0 performance improvements blog post that features all of these topics and more. We may need to pull out some of the more lengthy content into separate blog posts if needed, but I think from the user perspective, highlighting what they get in 1.7.0 performance wise will be nice. Agenda Doc: https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU Ben