[
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589159#comment-16589159
]
James Peach commented on MESOS-9178:
------------------------------------
/cc [~bmahler]
> Add a metric for master failover time.
> --------------------------------------
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Reporter: Xudong Ni
> Assignee: Xudong Ni
> Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if
> all agents are going to come back after a master failover so there's not a
> certain point that marks the end of "full reregistration of all agents".
> However empirically the number of agents usually don't change during the
> failover and there's an upper bound of such wait (after a 10min timeout the
> agents that haven't reregistered are going to be marked unreachable so we can
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered
> from the registry to be accounted for" i.e., either reregistered or marked as
> unreachable.
> This is of course looking at failover from an agent reregistration
> perspective.
> Later after we add framework info persistence, we can similarly define the
> framework perspective using reregistration time or reconciliation time.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)