> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote:
> > src/messages/messages.proto
> > Lines 637-638 (patched)
> > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637>
> >
> >     I think that this list should include pending operations, but should 
> > _not_ include terminal operations with unacknowledged updates.
> >     
> >     When the master receives an update, it needs to decide whether or not 
> > it should update the allocator. If the operation was pending when 
> > `UpdateSlaveMessage` was sent, then the master should update the allocator. 
> > If the operation was not pending but its update had not been acknowledged, 
> > then the result of that operation was already reflected in the last 
> > `UpdateSlaveMessage` from the agent, and the master should not update the 
> > allocator.
> >     
> >     If the agent includes only pending operations in this field, then the 
> > master will be able to make this distinction.

The reason we need to send terminal but unacked operations is for 
reconciliation. Master always keeps a cache about all the known non-completed 
operations (either pending or terminal but unacked). This information can be 
used to answer reconciliation requests.

The master can distinguish between a pending operation vs. a terminal but 
unacked operation by looking at latest state of the operation. The master 
update the allocator if the latest state if terminal, irrespective whether it's 
acked or not.

That does bring up an issue. Consider the following two cases:

1) Master knows about an operation that the agent does not know about. This can 
happen if the operation gets lost on its way to the agent. This will trigger an 
agent re-registration. The master in this case will need to generate a 
reconcile message (similar to that in `reconcileKnownSlave`) so that the agent 
or LRP can reply with a terminal status update. This will guarantee that the 
resources allocated for this operation are properly recovered to allocator.

This case is also possible if there is a speculation failure on old operations 
and result in an `UpdateSlaveMessage` being sent to the master. The operation 
that master knows about (but agent does not) will eventually reach agent or 
LRP, and will be rejected by the LRP or the agent. The master will properly 
recovered the resources to the allocator when a failed update is received.

2) Agent knows about an operation that the master does not know about. For 
instance, This is a new master that just fails over, and an LRP re-registers 
with the agent. In that case, the master needs to update the 'allocated' (i.e., 
'used') resources for those new operations. This is similar to what we do when 
slave registers. However, the tricky part is that the allocator does not have 
an interface currently allowing us to add more allocation to a given agent. We 
likely need to add that (or make it part of `Allocator::updateSlave` interface).


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63001/#review188162
-----------------------------------------------------------


On Oct. 16, 2017, 6:32 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63001/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2017, 6:32 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, 
> Greg Mann, Jan Schlicht, and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Updated protobuf definitions related to offer operations.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto ba87339dbe341f4d16ceea74adc09647a3c07f32 
>   include/mesos/resource_provider/resource_provider.proto 
> f5a9073075327019fd133bd51265f695ef464845 
>   include/mesos/v1/mesos.proto a6d662fb26aa4f78ef20ffe6e013f7a45f7f8c21 
>   include/mesos/v1/resource_provider/resource_provider.proto 
> e5cbede5b6e57a8641fca1ebfee5454f292cc24c 
>   src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f 
>   src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 
>   src/resource_provider/validation.cpp 
> d2927227f60ab0d4ae2481ad73a31ee444b48ee0 
>   src/tests/resource_provider_validation_tests.cpp 
> f182bff4670318e9de22c2915c5dbb423a74ad6c 
> 
> 
> Diff: https://reviews.apache.org/r/63001/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Reply via email to