[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

James Peach (JIRA) Tue, 11 Sep 2018 08:29:07 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610795#comment-16610795
 ]


James Peach commented on MESOS-9178:
------------------------------------

Another way to measure this is to publish it in the event stream.

> Add a metric for master failover time.
> --------------------------------------
>
>                 Key: MESOS-9178
>                 URL: https://issues.apache.org/jira/browse/MESOS-9178
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

Reply via email to