Xudong Ni created MESOS-9178:
--------------------------------

             Summary: mesos metrics: master failover time
                 Key: MESOS-9178
                 URL: https://issues.apache.org/jira/browse/MESOS-9178
             Project: Mesos
          Issue Type: Improvement
          Components: master
            Reporter: Xudong Ni


Quote from Yan Xu: Previous the argument against it is that you don't know if 
all agents are going to come back after a master failover so there's not a 
certain point that marks the end of "full reregistration of all agents". 
However empirically the number of agents usually don't change during the 
failover and there's an upper bound of such wait (after a 10min timeout the 
agents that haven't reregistered are going to be marked unreachable so we can 
just use that to stop the timer.

So we can define failover time as "the time it takes for all agents recovered 
from the registry to be accounted for" i.e., either reregistered or marked as 
unreachable.

This is of course looking at failover from an agent reregistration perspective.

Later after we add framework info persistence, we can similarly define the 
framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to