----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review208599 -----------------------------------------------------------
src/master/master.cpp Lines 7436 (patched) <https://reviews.apache.org/r/68706/#comment292676> There are a few considerations on using one timer for all agent reregistrations. 1. The full list of statistics only make sense when 100% of the agents are reregistered. See this example in the testing section: ``` "master/slaves_reregistration_secs":32.321583104, "master/slaves_reregistration_secs/count":7, "master/slaves_reregistration_secs/max":32.321583104, "master/slaves_reregistration_secs/min":3.35373696, "master/slaves_reregistration_secs/p50":8.774915072, "master/slaves_reregistration_secs/p90":30.8676036608, "master/slaves_reregistration_secs/p95":31.594593382399996, "master/slaves_reregistration_secs/p99":32.176185159679996, "master/slaves_reregistration_secs/p999":32.307043309567995, "master/slaves_reregistration_secs/p9999":32.3201291245568, ``` What does /p50 mean when 200 of the 1000 total are reregistered? I think metrics always being meaningful is a requirement for continuous monitoring. 2. `TimeSeries` drops old values during [sparsification](https://github.com/apache/mesos/blob/91ca8e2b2071f7e4b89702ae7c807b074bdef31b/3rdparty/libprocess/include/process/timeseries.hpp#L191). Therefore if the total agent count in the cluster is bigger than the time series capacity (large Mesos deployments have O(10000) agents, old values get dropped and the p25 is no longer "25% of the total agents" even after all agents are reregistered. This leads to the next point. 3. `Timer`'s semantics: it's suppose to work when each observation has equal semantics, e.g., each [state_store](http://mesos.apache.org/documentation/latest/monitoring/#registrar) has the same meaning, therefore it's OK to drop old values which may reduce precision but will not changing the meaning to the statistics. Let's start with how the monitoring system will use it and work backwards. I think my proposal in https://issues.apache.org/jira/browse/MESOS-9178 is worth considering but maybe there are better ones. - Jiang Yan Xu On Sept. 12, 2018, 2:42 p.m., Xudong Ni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > ----------------------------------------------------------- > > (Updated Sept. 12, 2018, 2:42 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > ------- > > When an agent is reregistrated, the time delta from that moment to > the master elected time was saved; In the progress of reregistration, > each data entry represents the registration time delta from master > elected time; The percentile of these data as in this metrics can > represent overall reregistration progress; In case of degradation > towards to the end of reregistration, the high percentile will > reflect it. > > Note: These metrics only represent the completed reregistration; It > does not monitor agents were finally marked as unreachable that the > reregistration didn't actually happen, the unreachable agents were > already monitored by existing metrics. > > > Diffs > ----- > > docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 > src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 > > > Diff: https://reviews.apache.org/r/68706/diff/1/ > > > Testing > ------- > > Tested in mmaster with seven reregistration agents: > "master/slaves_reregistration_secs":32.321583104, > "master/slaves_reregistration_secs/count":7, > "master/slaves_reregistration_secs/max":32.321583104, > "master/slaves_reregistration_secs/min":3.35373696, > "master/slaves_reregistration_secs/p50":8.774915072, > "master/slaves_reregistration_secs/p90":30.8676036608, > "master/slaves_reregistration_secs/p95":31.594593382399996, > "master/slaves_reregistration_secs/p99":32.176185159679996, > "master/slaves_reregistration_secs/p999":32.307043309567995, > "master/slaves_reregistration_secs/p9999":32.3201291245568, > > > Thanks, > > Xudong Ni > >
