[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607548#comment-16607548
 ] 

James Peach commented on MESOS-9178:
------------------------------------

Are we convinced that a metric is the right approach? This seems like something 
that you might want to compare over long time periods which might be more 
suitable to doing analytics on logs

> Add a metric for master failover time.
> --------------------------------------
>
>                 Key: MESOS-9178
>                 URL: https://issues.apache.org/jira/browse/MESOS-9178
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to