Awesome. Thanks for the write up, Ben! On Wed, Jul 18, 2018 at 2:55 PM Benjamin Mahler <bmah...@apache.org> wrote:
> For folks that missed it, here are my own notes. Thanks to alexr and dario > for presenting! > > (1) I discussed a high agent cpu usage issue when hitting the /containers > endpoint: > > https://issues.apache.org/jira/browse/MESOS-8418 > > This was resolved, but it didn't get attention for months until I noticed a > recent complaint about it in slack. It highlights the need to periodically > check for new performance tickets in the backlog. > > > (2) alexr presented slides on some ongoing work to improve the state > serving performance: > > > https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g > > This included measurements from clusters with many frameworks. The short > term plan (hopefully in 1.7.0) is to investigate batching / parallel > processing of state requests (still on the master actor), and halving the > queueing time via authorizing outside of the master actor. There are > potential longer term plans, but these short term improvements should take > us pretty far, along with (3). > > > (3) I presented some results from adapting our jsonify library to use > rapidjson under the covers, and it cuts our state serving time in half: > > > https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo > > The code is mainly done but there are a few things left to get it in a > reviewable state. > > > (4) I briefly mentioned some various other performance work: > > (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some > benchmarking and improvements were made to better handle a large number of > metrics, in support of per-framework metrics: > > https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets) > > There's still more open work that can be done here, but a more critical > user-facing improvement at this point is the migration to push gauges in > the master and allocator: > > https://issues.apache.org/jira/browse/MESOS-8914 > > (b) JSON parsing cost was cut in half by avoiding conversion through an > intermediate format and instead directly parsing into our data structures: > > https://issues.apache.org/jira/browse/MESOS-9067 > > > (5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on > benchmarking and making performance improvements to the allocator to speed > up allocation cycle time and to address "offer starvation". In our > multi-framework scale testing we saw allocation cycle time go down from 15 > secs to 5 secs, and there's still lots of low hanging fruit: > > https://issues.apache.org/jira/browse/MESOS-9087 > > For offer starvation, we fixed an offer fragmentation issue due to quota > "chopping" and we introduced the choice of a random weighted shuffle sorter > as an alternative to ensure that high share frameworks don't get starved. > We may also investigate introducing a round-robin sorter that shuffles > between rounds if needed: > > https://issues.apache.org/jira/browse/MESOS-8935 > https://issues.apache.org/jira/browse/MESOS-8936 > > > (6) Dario talked about the MPSC queue that was recently added to libprocess > for use in Process event queues. This needs to be enabled at configure-time > as is currently the case for the lock free structures, and should provide a > throughput improvement to libprocess. We still need to chart a path to > turning these libprocess performance enhancing features on by default. > > > (7) I can draft a 1.7.0 performance improvements blog post that features > all of these topics and more. We may need to pull out some of the more > lengthy content into separate blog posts if needed, but I think from the > user perspective, highlighting what they get in 1.7.0 performance wise will > be nice. > > Agenda Doc: > > https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU > > Ben >