Benjamin Mahler created MESOS-7688:
--------------------------------------
Summary: Improve master failover performance by reducing
unnecessary agent retries.
Key: MESOS-7688
URL: https://issues.apache.org/jira/browse/MESOS-7688
Project: Mesos
Issue Type: Improvement
Components: agent, master
Reporter: Benjamin Mahler
Currently, during a failover the agents will (re-)register with the master.
While the master is recovering, the master may drop messages from the agents,
and so the agents must retry registration using a backoff mechanism. For large
clusters, there can be a lot of overhead in processing unnecessary retries from
the agents, given that these messages must be deserialized and contain all of
the task / executor information many times over.
In order to reduce this overhead, the idea is to avoid the need for agents to
blindly retry (re-)registration with the master. Two approaches for this are:
(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of
an abuse of MasterInfo unfortunately, but the idea is for agents to only
(re-)register when they see that the master reaches a recovered state. Once
recovered, the master will not drop messages, and therefore agents only need to
retry when the connection breaks.
(2) Have the master reply with a retry message when it's in the recovering
state, so that agents get a clear signal that their messages were dropped. This
one is less optimal, because the master may have to process a lot of messages
and send retries, but once the master is recovered, the master will process
only a single (re-)registration from each agent. Here, agents only retry when
the connection breaks or they get a retry message.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)