> On Oct. 15, 2018, 4:45 p.m., James Peach wrote: > > I think that we need test for this as well. At minimum, we ought to update > > `MasterTest.MetricsInMetricsEndpoint`. Best would be a test that registers > > a number of agents, then restarts the master and validates the metrics.
Basic test was added; And we validated the actual metrics in the real environment > On Oct. 15, 2018, 4:45 p.m., James Peach wrote: > > src/master/master.cpp > > Lines 1850 (patched) > > <https://reviews.apache.org/r/68706/diff/3/?file=2095910#file2095910line1850> > > > > I found the arithmetic here pretty confusing. How about simplifying > > this to: > > ``` > > > > double percentRegistered = metrics->slave_reregistrations.value().get() > > / expectedAgentCount; > > > > if (slave25PercentageRegistered.value().get() == 0) { > > if (percentRegistered > 0.25) { > > slaves_25_percent_reregistered_secs = t; > > } > > } > > ``` This is actually fixed for another comments to use the ceil: if((recovered_agents_25_percent_reregistered_secs.value().get() == 0.0) && (reregisteredAgentCount == ceil(recoveredAgentCount.get() * 0.25))) { recovered_agents_25_percent_reregistered_secs = t; } - Xudong ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209547 ----------------------------------------------------------- On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > ----------------------------------------------------------- > > (Updated Oct. 19, 2018, 11:56 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > ------- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery couldn't be > reached, the intiail value of that percentile will not be updated. > > > Diffs > ----- > > src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > > > Diff: https://reviews.apache.org/r/68706/diff/7/ > > > Testing > ------- > > Automation: > [ RUN ] MasterTest.MetricsInMetricsEndpoint > [ OK ] MasterTest.MetricsInMetricsEndpoint (42 ms) > > Real world cases: > > While the master is not elected or there is no agents recovered yet > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 0.0, > "master/recovered_agents_50_percent_reregistered_secs": 0.0, > "master/recovered_agents_75_percent_reregistered_secs": 0.0, > "master/recovered_agents_90_percent_reregistered_secs": 0.0, > "master/recovered_agents_99_percent_reregistered_secs": 0.0, > "master/slave_reregistrations": 0.0, > > While reregistrations is in progress: 5 out of 6 completed: > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 2.0, > "master/recovered_agents_50_percent_reregistered_secs": 3.0, > "master/recovered_agents_75_percent_reregistered_secs": 6.0, > "master/recovered_agents_90_percent_reregistered_secs": 0.0, > "master/recovered_agents_99_percent_reregistered_secs": 0.0, > "master/slave_reregistrations": 5.0, > > > While 6 reregistrations were all completed: > "master/recovered_agents_100_percent_reregistered_secs": 22.0, > "master/recovered_agents_25_percent_reregistered_secs": 2.0, > "master/recovered_agents_50_percent_reregistered_secs": 3.0, > "master/recovered_agents_75_percent_reregistered_secs": 6.0, > "master/recovered_agents_90_percent_reregistered_secs": 22.0, > "master/recovered_agents_99_percent_reregistered_secs": 22.0, > "master/slave_reregistrations": 6.0, > > > Thanks, > > Xudong Ni > >
