[
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612870#comment-16612870
]
Yan Xu commented on MESOS-9178:
-------------------------------
So my proposal is that, we have the following metrics:..
{noformat:title=}
"master/p25_agents_reregistered_secs": 1,
"master/p50_agents_reregistered_secs": 2,
"master/p75_agents_reregistered_secs": 3,
"master/p90_agents_reregistered_secs": 4,
"master/p99_agents_reregistered_secs": 5,
"master/p100_agents_reregistered_secs": 6,
{noformat}
(welcome suggestion for the precise naming and unit)
Note that each of the metric only appears when such percentage of agents have
reregistered, and they do persist until the master fails over, then we start
over from having 0 of these metrics. Monitoring systems I have worked with all
support filling missing values with their previous values so if you plot this I
do expect them to continuously show the changes of failover performance over
time.
I agree that we can publish to the event stream (we currently have AGENT_ADDED
and AGENT_REMOVED) but for monitoring purposes it's shifting the metric
creation logic to an external entity.
In terms of implementation, given the current tools we have, I think it works
best if each metric above is its own timer (but comment in more details in the
review).
> Add a metric for master failover time.
> --------------------------------------
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Reporter: Xudong Ni
> Assignee: Xudong Ni
> Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)