[
https://issues.apache.org/jira/browse/MESOS-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224037#comment-16224037
]
Jie Yu commented on MESOS-8058:
-------------------------------
commit 75cad1213e218a7f114c46c3d7c92047dae80345 (HEAD -> master, origin/master,
origin/HEAD)
Author: Benjamin Bannier <[email protected]>
Date: Fri Oct 13 21:05:27 2017 -0700
Disallowed combining resource providers and CheckpointResourcesMessage.
Offer operations on resource provider resources can require
asynchronous handling since they can in principal take a long time to
complete. Additionally, they can fail even after passing validation in
the master, e.g., due to outside changes to the affected resources.
For these reasons, resource provider resources require an offer
operation protocol allowing failures outside of the master and
communicating these failures to the master.
Since this feedback can only be provided asynchronously, resource
provider resources are incompatible with `CheckpointResourcesMessage`
which by design updates the agent with the master's view of the
agent's resources, and does not account for asynchronous changes to
the agent's resources (leading e.g., to incompatible state between
master and agents).
This patch makes sure that agents with resource providers do not use
the 'CheckpointResourcesMessage' protocol. This prevents users from
running resource provider agents against legacy masters.
Review: https://reviews.apache.org/r/62974/
> Agent and master can race when updating agent state
> ---------------------------------------------------
>
> Key: MESOS-8058
> URL: https://issues.apache.org/jira/browse/MESOS-8058
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.5.0
> Reporter: Benjamin Bannier
> Assignee: Benjamin Bannier
> Priority: Critical
> Labels: mesosphere
> Fix For: 1.5.0
>
>
> In {{2af9a5b07dc80151154264e974d03f56a1c25838}} we introduce the use of
> {{UpdateSlaveMessage}} for the agent to inform the master about its current
> total resources. Currently we trigger this message only on agent registration
> and reregistration.
> This can race with operations applied in the master and communicated via
> {{CheckpointResourcesMessage}}.
> Example:
> 1. Agent ({{cpus:4(\*)}} registers.
> 2. Master is triggered to apply an operation to the agent's resources, e.g.,
> a reservation: {{cpus:4(\*) -> cpus:4(A)}}. The master applies the operation
> to its current view of the agent's resources and sends the agent a
> {{CheckpointResourcesMessage}} so the agent can persist the result.
> 3. The agent sends the master an {{UpdateSlaveMessage}}, e.g., {{cpus:4(\*)}}
> since it hasn't received the {{CheckpointResourcesMessage}} yet.
> 4. The master processes the {{UpdateSlaveMessage}} and updates its view of
> the agent's resources to be {{cpus:4(\*)}}.
> 5. The agent processes the {{CheckpointResourcesMessage}} and updates its
> view of its resources to be {{cpus:4(A)}}.
> 6. The agent and the master have an inconsistent view of the agent's
> resources.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)