> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 1874-1875 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874> > > > > If we use equality operator and only set the timer when such a number > > of reregistered agent is reached we are guaranteed to only set each timer > > once (but we may need to set multiple timers in one call) right? This > > alleviates the need to check if the timer is already set! > > > > This should also work with boundary cases like the total recovered > > agents count being 0 or 1 (overlapping percentiles) etc. Right? > > Jiang Yan Xu wrote: > We should comment on the reasons for dropping issues. > > Xudong Ni wrote: > This is actaully fixed and marked as dropped somehow. We used equality > operator to 0.0 to check whether the percentile was reached or not; The > reason we used push gauge not timer is explained in push gauge vs timer > comments section
The point here is "This alleviates the need to check if the timer is already set!". It's still in the new revision. I made another comment about it. > On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/metrics.hpp > > Lines 51 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51> > > > > On using Timer (e.g., like > > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) > > vs. PushGauge, after looking at how it'll be used I think the main > > advantage of Timer is that it doesn't export any value if you haven't set > > it. > > > > Consider the two cases: > > > > 1. There are 1000 recovered agents and 0 have reregsitered, should all > > the timers have zero values or should they be absent? > > > > 2. There are 0 recovered agents (e.g., brand new cluster), should all > > of the metrics be zero or non-existent? I feel like they should be zero, as > > in, e.g., 100% of all 0 agents are reregistered within 0 secs. > > > > So timer handles this natrually. Also it sets the `_secs` name for you > > but that's a minor conveninence. > > Jiang Yan Xu wrote: > We should comment on the reasons for dropping issues. > > Xudong Ni wrote: > Sorry about this, I did make the comments but it must be in one of draft > which was not saved. > > I agree that metrics should be zero but not absent when certain > percentige were not reached. I did tried both PushGauge and Timer > implementation and tested in our clusters. > > If we used the timer, when the value was not set, that metric is actually > missing. PushGauge is set with the initial value 0.0 and we can tell whether > it's set yet, the metric will always exist no matter that percentile reached > or not, and it has better performance. > > The brand new cluster case was tested and the results were included in > the test results. My point was the opposite. When a percentage is not reached, I feel that the value being absent is preferrable and I used the above two cases to demonstrate it. (There is a subtle difference between the two IMO). It's in fact the current practice used by metrics like `state_fetch`. - Jiang Yan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209722 ----------------------------------------------------------- On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > ----------------------------------------------------------- > > (Updated Oct. 19, 2018, 4:56 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > ------- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery couldn't be > reached, the intiail value of that percentile will not be updated. > > > Diffs > ----- > > src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > > > Diff: https://reviews.apache.org/r/68706/diff/7/ > > > Testing > ------- > > Automation: > [ RUN ] MasterTest.MetricsInMetricsEndpoint > [ OK ] MasterTest.MetricsInMetricsEndpoint (42 ms) > > Real world cases: > > While the master is not elected or there is no agents recovered yet > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 0.0, > "master/recovered_agents_50_percent_reregistered_secs": 0.0, > "master/recovered_agents_75_percent_reregistered_secs": 0.0, > "master/recovered_agents_90_percent_reregistered_secs": 0.0, > "master/recovered_agents_99_percent_reregistered_secs": 0.0, > "master/slave_reregistrations": 0.0, > > While reregistrations is in progress: 5 out of 6 completed: > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 2.0, > "master/recovered_agents_50_percent_reregistered_secs": 3.0, > "master/recovered_agents_75_percent_reregistered_secs": 6.0, > "master/recovered_agents_90_percent_reregistered_secs": 0.0, > "master/recovered_agents_99_percent_reregistered_secs": 0.0, > "master/slave_reregistrations": 5.0, > > > While 6 reregistrations were all completed: > "master/recovered_agents_100_percent_reregistered_secs": 22.0, > "master/recovered_agents_25_percent_reregistered_secs": 2.0, > "master/recovered_agents_50_percent_reregistered_secs": 3.0, > "master/recovered_agents_75_percent_reregistered_secs": 6.0, > "master/recovered_agents_90_percent_reregistered_secs": 22.0, > "master/recovered_agents_99_percent_reregistered_secs": 22.0, > "master/slave_reregistrations": 6.0, > > > Thanks, > > Xudong Ni > >