-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------




src/master/master.hpp
Lines 2346-2347 (patched)
<https://reviews.apache.org/r/68706/#comment294277>

    I wasn't sure if we needed a dedicated struct / method for these metrics 
when we first discussed this approach but looking at the the additional method, 
math and class member state that we need to track I feel that we do. 
    
    It's probably the best if we put such logic in the master `Metrics` class. 
It'll isolate the metric calculation logic from the master and you can use the 
struct's member state to help with tracking.
    
    Similar examples already exist here. e.g., 
https://github.com/apache/mesos/blob/f03f91545fec64a767f808c48a3b352dc86b1ac0/src/master/metrics.hpp#L89
    
    I propose that we add something like this
    
    ```
    struct AgentReregistrations
    {
      AgentReregistrations(int recoveredAgentCount);
      ~AgentReregistrations();
    
      // If you use `Timer` you don't even need this. See comment about Timer 
below.
      process::Time electedTime, 
      int recoveredAgentCount;
      int reregisteredAgentCount = 0;
      // ... Whatever else state you may need.
      
      process::metrics::PushGauge recovered_agents_25_percent_reregistered_secs;
      // ... other gauges.
      
      
      void incrementAgentReregistered();
    };
    
    // Option since you only set it after the master is elected.
    Option<AgentReregistrations> agentReregistrations;
    ```



src/master/master.cpp
Lines 1866-1868 (patched)
<https://reviews.apache.org/r/68706/#comment294282>

    But we naturally don't need float comparison, we are comparing the 
reregistered agents count (int) against the **minimum** number of agents that 
**satisfy** the percentage requirement (int). So if you take a `ceil` of the 
floating point multiplication result and cast to int you'll get the 
semantically correct number?
    
    e.g., if total = 5 and we are calculating 75% the `reregisteredAgentCount` 
we are looking for is 4.



src/master/master.cpp
Lines 1874-1875 (patched)
<https://reviews.apache.org/r/68706/#comment294312>

    If we use equality operator and only set the timer when such a number of 
reregistered agent is reached we are guaranteed to only set each timer once 
(but we may need to set multiple timers in one call) right? This alleviates the 
need to check if the timer is already set!
    
    This should also work with boundary cases like the total recovered agents 
count being 0 or 1 (overlapping percentiles) etc. Right?



src/master/metrics.hpp
Lines 50 (patched)
<https://reviews.apache.org/r/68706/#comment294275>

    Can we use the names agreed-upon with BenM in the slack channel which is 
e.g., `recovered_agents_50_percent_reregistered_secs`?



src/master/metrics.hpp
Lines 51 (patched)
<https://reviews.apache.org/r/68706/#comment294313>

    On using Timer (e.g., like 
[state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
 vs. PushGauge, after looking at how it'll be used I think the main advantage 
of Timer is that it doesn't export any value if you haven't set it.
    
    Consider the two cases:
    
    1. There are 1000 recovered agents and 0 have reregsitered, should all the 
timers have zero values or should they be absent?
    
    2. There are 0 recovered agents (e.g., brand new cluster), should all of 
the metrics be zero or non-existent? I feel like they should be zero, as in, 
e.g., 100% of all 0 agents are reregistered within 0 secs.
    
    So timer handles this natrually. Also it sets the `_secs` name for you but 
that's a minor conveninence.


- Jiang Yan Xu


On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Reply via email to