[ 
https://issues.apache.org/jira/browse/MESOS-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224037#comment-16224037
 ] 

Jie Yu commented on MESOS-8058:
-------------------------------

commit 75cad1213e218a7f114c46c3d7c92047dae80345 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Benjamin Bannier <[email protected]>
Date:   Fri Oct 13 21:05:27 2017 -0700

    Disallowed combining resource providers and CheckpointResourcesMessage.

    Offer operations on resource provider resources can require
    asynchronous handling since they can in principal take a long time to
    complete. Additionally, they can fail even after passing validation in
    the master, e.g., due to outside changes to the affected resources.
    For these reasons, resource provider resources require an offer
    operation protocol allowing failures outside of the master and
    communicating these failures to the master.

    Since this feedback can only be provided asynchronously, resource
    provider resources are incompatible with `CheckpointResourcesMessage`
    which by design updates the agent with the master's view of the
    agent's resources, and does not account for asynchronous changes to
    the agent's resources (leading e.g., to incompatible state between
    master and agents).

    This patch makes sure that agents with resource providers do not use
    the 'CheckpointResourcesMessage' protocol. This prevents users from
    running resource provider agents against legacy masters.

    Review: https://reviews.apache.org/r/62974/

> Agent and master can race when updating agent state
> ---------------------------------------------------
>
>                 Key: MESOS-8058
>                 URL: https://issues.apache.org/jira/browse/MESOS-8058
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.5.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Critical
>              Labels: mesosphere
>             Fix For: 1.5.0
>
>
> In {{2af9a5b07dc80151154264e974d03f56a1c25838}} we introduce the use of 
> {{UpdateSlaveMessage}} for the agent to inform the master about its current 
> total resources. Currently we trigger this message only on agent registration 
> and reregistration.
> This can race with operations applied in the master and communicated via 
> {{CheckpointResourcesMessage}}.
> Example:
> 1. Agent ({{cpus:4(\*)}} registers.
> 2. Master is triggered to apply an operation to the agent's resources, e.g., 
> a reservation: {{cpus:4(\*) -> cpus:4(A)}}. The master applies the operation 
> to its current view of the agent's resources and sends the agent a 
> {{CheckpointResourcesMessage}} so the agent can persist the result.
> 3. The agent sends the master an {{UpdateSlaveMessage}}, e.g., {{cpus:4(\*)}} 
> since it hasn't received the {{CheckpointResourcesMessage}} yet.
> 4. The master processes the {{UpdateSlaveMessage}} and updates its view of 
> the agent's resources to be {{cpus:4(\*)}}.
> 5. The agent processes the {{CheckpointResourcesMessage}} and updates its 
> view of its resources to be {{cpus:4(A)}}.
> 6. The agent and the master have an inconsistent view of the agent's 
> resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to