[ 
https://issues.apache.org/jira/browse/MESOS-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794823#comment-16794823
 ] 

Benjamin Bannier commented on MESOS-9313:
-----------------------------------------

I would argue that since a framework _always_ needs to anticipate failures 
(e.g., temporary or permanent agent node failures, messages getting lost in 
failovers etc.), knowing about the internal implementation of certain offer 
operations inside Mesos should be inconsequential to the framework 
implementation. Here a speculation failure (e.g., failure of an agent to 
checkpoint a speculatively applied {{RESERVE}} operation) should not be 
different from e.g., the operation getting lost by an agent failover while the 
operation is in flight. A framework would use the same approach to reconcile 
agent state.

Note that before the introduction of the non-speculative operations 
{{CREATE_DISK}}, {{CREATE_VOLUME}} (and their {{DESTROY}} counterparts), _all 
operations_ were applied speculatively.

> Document speculative offer operation semantics for framework writers.
> ---------------------------------------------------------------------
>
>                 Key: MESOS-9313
>                 URL: https://issues.apache.org/jira/browse/MESOS-9313
>             Project: Mesos
>          Issue Type: Documentation
>          Components: documentation
>            Reporter: James DeFelice
>            Priority: Major
>              Labels: mesosphere, operation-feedback, operations
>
> It recently came to my attention that a subset of offer operations (e.g. 
> RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos 
> master. Meaning that the master will apply the resource conversion internally 
> **before** the conversion is checkpointed on the agent. The master may then 
> re-offer the converted resource to a framework -- even though the agent may 
> still not have checkpointed the resource conversion. If the checkpointing 
> process on the agent fails, then subsequent operations issued for the 
> falsely-offered resource will fail. Because the master essentially "lied" to 
> the framework about the true state of the supposedly-converted resource.
> It's also been explained to me that this case is expected to be rare. 
> However, it *can* impact the design/implementation of framework state 
> machines and so it's critical that this information be documented clearly - 
> outside of the C++ code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to