[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609911#comment-16609911
 ] 

James Peach commented on MESOS-9178:
------------------------------------

Say you have a time-series gauge at various percentages as per [~bmahler]'s 
suggestion. The gauge value would have to persist, so once it is set, it would 
remain at that value thereafter. If you needed to do analytics, you need to 
carefully choose the first sample after a failover. For time-series, the 
easiest thing to do is to plot it, and it's not at all clear to me how you 
could do that and show a meaningful graph because what you really want is to 
compare the historical failover times. I'm not that experienced with Grafana 
but I don't know how I would do that.

> Add a metric for master failover time.
> --------------------------------------
>
>                 Key: MESOS-9178
>                 URL: https://issues.apache.org/jira/browse/MESOS-9178
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to