-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209547
-----------------------------------------------------------



I think that we need test for this as well. At minimum, we ought to update 
`MasterTest.MetricsInMetricsEndpoint`. Best would be a test that registers a 
number of agents, then restarts the master and validates the metrics.


docs/monitoring.md
Lines 424 (patched)
<https://reviews.apache.org/r/68706/#comment294074>

    Although (for historical consistency) the metric name should use the 
`slave` terminolofy, everything else should use `agent`, including 
documentation and code comments.



docs/monitoring.md
Lines 431 (patched)
<https://reviews.apache.org/r/68706/#comment294075>

    I think that we can make this explanation clearer. This is the elapsed time 
for 50% of the previously registered agents to re-register with the newly 
epected master after a failover.



src/master/master.hpp
Lines 1176 (patched)
<https://reviews.apache.org/r/68706/#comment294073>

    Let's call this
    ```
    updateReregistrationMetrics();
    ```



src/master/master.hpp
Lines 2344 (patched)
<https://reviews.apache.org/r/68706/#comment294076>

    Since `expectedAgentCount` isn't set until `_recover()`, let's make it and 
`Option<int>`. Consider leaving a comment for future generations.



src/master/master.hpp
Lines 2422 (patched)
<https://reviews.apache.org/r/68706/#comment294072>

    All these helper functions can be removed once you switch to `PushGauge`.



src/master/master.cpp
Lines 1850 (patched)
<https://reviews.apache.org/r/68706/#comment294071>

    I found the arithmetic here pretty confusing. How about simplifying this to:
    ```
    
    double percentRegistered = metrics->slave_reregistrations.value().get() / 
expectedAgentCount;
    
    if (slave25PercentageRegistered.value().get() == 0) {
      if (percentRegistered > 0.25) {
        slaves_25_percent_reregistered_secs = t;
      }
    }
    ```



src/master/master.cpp
Lines 1891 (patched)
<https://reviews.apache.org/r/68706/#comment294069>

    2 newlines between file-scope functions.



src/master/metrics.hpp
Lines 51 (patched)
<https://reviews.apache.org/r/68706/#comment294066>

    Comments should be complete sentences:
    ```
    // Timer to track the progress of agent re-registration.
    ```
    
    Note that, all comments are standardized on "re-registration".



src/master/metrics.hpp
Lines 58 (patched)
<https://reviews.apache.org/r/68706/#comment294067>

    Let's make these all `PullGauge` objects. Initializing these to `0` (the 
default) would be indistinguishable from a metrics publishing perspective, but 
the performance would be better.



src/master/metrics.cpp
Lines 53 (patched)
<https://reviews.apache.org/r/68706/#comment294064>

    This metric isn't documented and I'm not clear on what its semantics are. 
Let's remove it.


- James Peach


On Oct. 10, 2018, 5:22 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 10, 2018, 5:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/3/
> 
> 
> Testing
> -------
> 
> Tested in mmaster with 6 reregistration agents:
> "master/slave_reregistrations": 6,
> 
> In the middle of reregistration process:
> "master/slaves_100_percent_reregistered_secs": 0,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 0,
> "master/slaves_99_percent_reregistered_secs": 0,
> 
> When all registrations finished:
> "master/slaves_100_percent_reregistered_secs": 29.697210112,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 29.697210112,
> "master/slaves_99_percent_reregistered_secs": 29.697210112,
> 
> With 3606 agents, the last 1% take significant time
> "master/slave_reregistrations": 3606,
> "master/slave_shutdowns_canceled": 0,
> "master/slave_shutdowns_completed": 0,
> "master/slave_shutdowns_scheduled": 0,
> "master/slave_unreachable_canceled": 0,
> "master/slave_unreachable_completed": 0,
> "master/slave_unreachable_scheduled": 0,
> "master/slaves_100_percent_reregistered_secs": 58.585202944,
> "master/slaves_25_percent_reregistered_secs": 9.966434048,
> "master/slaves_50_percent_reregistered_secs": 20.259571968,
> "master/slaves_75_percent_reregistered_secs": 30.598885888,
> "master/slaves_90_percent_reregistered_secs": 36.396082944,
> "master/slaves_99_percent_reregistered_secs": 39.811022848,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Reply via email to