[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612870#comment-16612870
 ] 

Yan Xu commented on MESOS-9178:
-------------------------------

So my proposal is that, we have the following metrics:..

{noformat:title=}
"master/p25_agents_reregistered_secs": 1,
"master/p50_agents_reregistered_secs": 2,
"master/p75_agents_reregistered_secs": 3,
"master/p90_agents_reregistered_secs": 4,
"master/p99_agents_reregistered_secs": 5,
"master/p100_agents_reregistered_secs": 6,
{noformat}

(welcome suggestion for the precise naming and unit)

Note that each of the metric only appears when such percentage of agents have 
reregistered, and they do persist until the master fails over, then we start 
over from having 0 of these metrics. Monitoring systems I have worked with all 
support filling missing values with their previous values so if you plot this I 
do expect them to continuously show the changes of failover performance over 
time.

I agree that we can publish to the event stream (we currently have AGENT_ADDED 
and AGENT_REMOVED) but for monitoring purposes it's shifting the metric 
creation logic to an external entity.

In terms of implementation, given the current tools we have, I think it works 
best if each metric above is its own timer (but comment in more details in the 
review).

> Add a metric for master failover time.
> --------------------------------------
>
>                 Key: MESOS-9178
>                 URL: https://issues.apache.org/jira/browse/MESOS-9178
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to