Re: Review Request 63001: Updated protobuf definitions related to offer operations.

Greg Mann Thu, 26 Oct 2017 15:34:21 -0700


> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote:
> > src/messages/messages.proto
> > Lines 637-638 (patched)
> > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637>
> >
> >     I think that this list should include pending operations, but should 
> > _not_ include terminal operations with unacknowledged updates.
> >     
> >     When the master receives an update, it needs to decide whether or not 
> > it should update the allocator. If the operation was pending when 
> > `UpdateSlaveMessage` was sent, then the master should update the allocator. 
> > If the operation was not pending but its update had not been acknowledged, 
> > then the result of that operation was already reflected in the last 
> > `UpdateSlaveMessage` from the agent, and the master should not update the 
> > allocator.
> >     
> >     If the agent includes only pending operations in this field, then the 
> > master will be able to make this distinction.
> 
> Jie Yu wrote:
>     The reason we need to send terminal but unacked operations is for 
> reconciliation. Master always keeps a cache about all the known non-completed 
> operations (either pending or terminal but unacked). This information can be 
> used to answer reconciliation requests.
>     
>     The master can distinguish between a pending operation vs. a terminal but 
> unacked operation by looking at latest state of the operation. The master 
> update the allocator if the latest state if terminal, irrespective whether 
> it's acked or not.
>     
>     That does bring up an issue. Consider the following two cases:
>     
>     1) Master knows about an operation that the agent does not know about. 
> This can happen if the operation gets lost on its way to the agent. This will 
> trigger an agent re-registration. The master in this case will need to 
> generate a reconcile message (similar to that in `reconcileKnownSlave`) so 
> that the agent or LRP can reply with a terminal status update. This will 
> guarantee that the resources allocated for this operation are properly 
> recovered to allocator.
>     
>     This case is also possible if there is a speculation failure on old 
> operations and result in an `UpdateSlaveMessage` being sent to the master. 
> The operation that master knows about (but agent does not) will eventually 
> reach agent or LRP, and will be rejected by the LRP or the agent. The master 
> will properly recovered the resources to the allocator when a failed update 
> is received.
>     
>     2) Agent knows about an operation that the master does not know about. 
> For instance, This is a new master that just fails over, and an LRP 
> re-registers with the agent. In that case, the master needs to update the 
> 'allocated' (i.e., 'used') resources for those new operations. This is 
> similar to what we do when slave registers. However, the tricky part is that 
> the allocator does not have an interface currently allowing us to add more 
> allocation to a given agent. We likely need to add that (or make it part of 
> `Allocator::updateSlave` interface).
> 
> Greg Mann wrote:
>     OK, I see why we need to include unacked updates as well. Thanks Jie!

Hey Jie, I have another question related to this issue so I re-opened it for 
discussion.

Previously you wrote "The master update the allocator if the latest state if 
terminal, irrespective whether it's acked or not."
This implies that subsequent retries of the update will _not_ have the 
`latest_state` set; otherwise, the master would not be able to distinguish 
between the first terminal status update and subsequent retries. Was it your 
intention to suggest that behavior? Looking at the `Slave::forward` helper that 
we use when intitializing the status update manager, it looks like we _always_ 
set `latest_state` for task status updates: 
https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/slave/slave.cpp#L4867-L4870

I suppose we could set the latest state the first time an operation update is 
sent to master, and then _not_ set it in subsequent retries. Then the value of 
`.has_latest_state()` would be a direct indication of whether or not an update 
is a retry.

- Greg

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63001/#review188162
-----------------------------------------------------------

On Oct. 19, 2017, 12:08 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63001/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2017, 12:08 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, 
> Greg Mann, Jan Schlicht, and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Updated protobuf definitions related to offer operations.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 859fdff4d9a0604bc506b08af79075084ae23466 
>   include/mesos/resource_provider/resource_provider.proto 
> f5a9073075327019fd133bd51265f695ef464845 
>   include/mesos/v1/mesos.proto cfd4abd3af1d8c9fbd31659161eada9ec9f92282 
>   include/mesos/v1/resource_provider/resource_provider.proto 
> e5cbede5b6e57a8641fca1ebfee5454f292cc24c 
>   src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f 
>   src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 
>   src/resource_provider/validation.cpp 
> d2927227f60ab0d4ae2481ad73a31ee444b48ee0 
>   src/tests/resource_provider_validation_tests.cpp 
> f182bff4670318e9de22c2915c5dbb423a74ad6c 
> 
> 
> Diff: https://reviews.apache.org/r/63001/diff/9/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Re: Review Request 63001: Updated protobuf definitions related to offer operations.

Reply via email to