Neil Conway created MESOS-7681:
----------------------------------
Summary: Add safeguard for new agents with new features + old
master
Key: MESOS-7681
URL: https://issues.apache.org/jira/browse/MESOS-7681
Project: Mesos
Issue Type: Improvement
Reporter: Neil Conway
Consider this scenario:
* Mesos cluster with 3 masters and 1 agent.
* 2 of the masters (including the leader) are upgraded to Mesos 1.4; remaining
master stays at Mesos 1.3 (e.g., due to operator error).
* Agent is upgraded to Mesos 1.4
* Framework creates a reservation refinement on the agent
* Leading master fails; Mesos 1.3 master is elected as the new leader
In this scenario, the agent will send resources to the master in the new
(post-refinement) format, but the master will not understand those new fields.
This results in an inconsistency between the agent's resources and the master's
view of the agent's resources. This could lead to various problems -- in
effect, the reservation the framework previously made has been "forgotten"
during master failover. Similarly, if the agent attempts to unreserve the
resources (using the master's version of the resource), that operation will be
rejected by the agent.
To fix this, it seems we need an explicit negotiation between the agent and the
master as part of registration/re-registration. The agent would examine its
resources and say which capabilities it _requires_ of the master; if the master
does not support those resources, the agent cannot safely register. We could
implement this either via master capabilities (agent computes the master
capabilities it requires and declines to register if the master isn't new
enough), or via agent capabilities (agent tells master the capabilities it is
"actively using"; master refuses to allow any agent to register that is using a
capability the master doesn't recognize/support). Probably the former is
safer/cleaner.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)