----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209547 -----------------------------------------------------------
I think that we need test for this as well. At minimum, we ought to update `MasterTest.MetricsInMetricsEndpoint`. Best would be a test that registers a number of agents, then restarts the master and validates the metrics. docs/monitoring.md Lines 424 (patched) <https://reviews.apache.org/r/68706/#comment294074> Although (for historical consistency) the metric name should use the `slave` terminolofy, everything else should use `agent`, including documentation and code comments. docs/monitoring.md Lines 431 (patched) <https://reviews.apache.org/r/68706/#comment294075> I think that we can make this explanation clearer. This is the elapsed time for 50% of the previously registered agents to re-register with the newly epected master after a failover. src/master/master.hpp Lines 1176 (patched) <https://reviews.apache.org/r/68706/#comment294073> Let's call this ``` updateReregistrationMetrics(); ``` src/master/master.hpp Lines 2344 (patched) <https://reviews.apache.org/r/68706/#comment294076> Since `expectedAgentCount` isn't set until `_recover()`, let's make it and `Option<int>`. Consider leaving a comment for future generations. src/master/master.hpp Lines 2422 (patched) <https://reviews.apache.org/r/68706/#comment294072> All these helper functions can be removed once you switch to `PushGauge`. src/master/master.cpp Lines 1850 (patched) <https://reviews.apache.org/r/68706/#comment294071> I found the arithmetic here pretty confusing. How about simplifying this to: ``` double percentRegistered = metrics->slave_reregistrations.value().get() / expectedAgentCount; if (slave25PercentageRegistered.value().get() == 0) { if (percentRegistered > 0.25) { slaves_25_percent_reregistered_secs = t; } } ``` src/master/master.cpp Lines 1891 (patched) <https://reviews.apache.org/r/68706/#comment294069> 2 newlines between file-scope functions. src/master/metrics.hpp Lines 51 (patched) <https://reviews.apache.org/r/68706/#comment294066> Comments should be complete sentences: ``` // Timer to track the progress of agent re-registration. ``` Note that, all comments are standardized on "re-registration". src/master/metrics.hpp Lines 58 (patched) <https://reviews.apache.org/r/68706/#comment294067> Let's make these all `PullGauge` objects. Initializing these to `0` (the default) would be indistinguishable from a metrics publishing perspective, but the performance would be better. src/master/metrics.cpp Lines 53 (patched) <https://reviews.apache.org/r/68706/#comment294064> This metric isn't documented and I'm not clear on what its semantics are. Let's remove it. - James Peach On Oct. 10, 2018, 5:22 p.m., Xudong Ni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > ----------------------------------------------------------- > > (Updated Oct. 10, 2018, 5:22 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > ------- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery couldn't be > reached, the intiail value of that percentile will not be updated. > > > Diffs > ----- > > docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 > src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 > src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 > > > Diff: https://reviews.apache.org/r/68706/diff/3/ > > > Testing > ------- > > Tested in mmaster with 6 reregistration agents: > "master/slave_reregistrations": 6, > > In the middle of reregistration process: > "master/slaves_100_percent_reregistered_secs": 0, > "master/slaves_25_percent_reregistered_secs": 2.244662016, > "master/slaves_50_percent_reregistered_secs": 3.599491072, > "master/slaves_75_percent_reregistered_secs": 9.53919616, > "master/slaves_90_percent_reregistered_secs": 0, > "master/slaves_99_percent_reregistered_secs": 0, > > When all registrations finished: > "master/slaves_100_percent_reregistered_secs": 29.697210112, > "master/slaves_25_percent_reregistered_secs": 2.244662016, > "master/slaves_50_percent_reregistered_secs": 3.599491072, > "master/slaves_75_percent_reregistered_secs": 9.53919616, > "master/slaves_90_percent_reregistered_secs": 29.697210112, > "master/slaves_99_percent_reregistered_secs": 29.697210112, > > With 3606 agents, the last 1% take significant time > "master/slave_reregistrations": 3606, > "master/slave_shutdowns_canceled": 0, > "master/slave_shutdowns_completed": 0, > "master/slave_shutdowns_scheduled": 0, > "master/slave_unreachable_canceled": 0, > "master/slave_unreachable_completed": 0, > "master/slave_unreachable_scheduled": 0, > "master/slaves_100_percent_reregistered_secs": 58.585202944, > "master/slaves_25_percent_reregistered_secs": 9.966434048, > "master/slaves_50_percent_reregistered_secs": 20.259571968, > "master/slaves_75_percent_reregistered_secs": 30.598885888, > "master/slaves_90_percent_reregistered_secs": 36.396082944, > "master/slaves_99_percent_reregistered_secs": 39.811022848, > > > Thanks, > > Xudong Ni > >
