[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610795#comment-16610795 ]
James Peach commented on MESOS-9178: ------------------------------------ Another way to measure this is to publish it in the event stream. > Add a metric for master failover time. > -------------------------------------- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master > Reporter: Xudong Ni > Assignee: Xudong Ni > Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)