> On March 9, 2018, 6:53 p.m., James Peach wrote: > > src/slave/metrics.cpp > > Lines 259 (patched) > > <https://reviews.apache.org/r/65954/diff/2/?file=1972384#file1972384line259> > > > > I don't know that I like the idea of a metric that is absent and then > > present. I'd prefer that we just published a `0.0` until recovert is > > complete. > > > > Suggest we keep the recovery timestamp in the `Slave` and just publish > > that.
I thought about that too, but I actually like the idea of the metric being absent when the value is not available yet. A zero value could confuse downstream aggregation. For example, our team want to gather an average of recovery time across our cluster of thousands of agents, but a presence of zero value could mistake the calculation. I think Mesos already have some precedence on absent then present metrics. For instance, metrics in `allocator/mesos/roles/<role>/...` could show up if framework under a new role registers after Master started. Let me know what do you think. > On March 9, 2018, 6:53 p.m., James Peach wrote: > > src/slave/slave.cpp > > Lines 7322 (patched) > > <https://reviews.apache.org/r/65954/diff/2/?file=1972385#file1972385line7322> > > > > Since the gauge is being published in seconds, you need to use > > `Duration::secs` to convert. I prefer the API call to work on `Duration` and perform the `secs()` as late as possible, as I've seen so many times when people pass a wrong time unit if the API task an integer/float. - Zhitao ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65954/#review198952 ----------------------------------------------------------- On March 7, 2018, 11:20 p.m., Zhitao Li wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65954/ > ----------------------------------------------------------- > > (Updated March 7, 2018, 11:20 p.m.) > > > Review request for mesos, Gilbert Song, Greg Mann, Jason Lai, and James Peach. > > > Bugs: MESOS-8609 > https://issues.apache.org/jira/browse/MESOS-8609 > > > Repository: mesos > > > Description > ------- > > The new metric `slave/recover_secs` can be used to tell us how long > Mesos agent needed to finish its recovery cycle. This is an important > metric on agent machines which have a lot of completed executor > sandboxes. > > Note that the metric 1) will only be available after recovery succeeded > and 2) never change its value across agent process lifecycle afterwards. > > > Diffs > ----- > > src/slave/metrics.hpp 3fc933ca65690d6fad63156398ad9c2c53789296 > src/slave/metrics.cpp 0eb2b59ed67e14e73b29d7592c239441df0008d5 > src/slave/slave.cpp e2facb3c15a2f907f6497c58a36842ed707f2c70 > > > Diff: https://reviews.apache.org/r/65954/diff/2/ > > > Testing > ------- > > > Thanks, > > Zhitao Li > >